Author Topic: A skeleton code for Text Scroller via Drag-and-Drop  (Read 1249 times)

0 Members and 1 Guest are viewing this topic.

Offline Sanmayce

  • Newbie
  • Posts: 18
  • Where is that English Text Sidekick?
    • Sanmayce's home
Re: A skeleton code for Text Scroller via Drag-and-Drop
« Reply #30 on: January 27, 2021, 05:08:41 AM »
 

Steve, a ton indeed, but what took you so long? Heh-heh.

Without strstr()-like, inhere instr(), exploitation, building these strings  is causa perduta, I knew that, yet I wanted to know how QB64 behaves.

Last night have had few hours and have come up with new parser, which I enliked instantly, mainly because it is written with minimal (in a lazy manner I converted the string array to ... a string function) efforts/changes.

 


It is maybe 30x (150 seconds) slower than Steve's, however it is written having Wikipedia in mind, it has sub-variant not needing the file to be in RAM, commented for now.

The benefits:

- now the parsing is in one pass only;
- the simulated "string" now has a field housing its length, so LEN() is unnecessary;
- for the sub-variant not loading the file into physical memory the RAM footprint is MUTSUNKA - only (8bytes offset + 8bytes length) x lines, or 16x4MB=64+MB (in fact 133MB) for OED.DSL with 4,071,706 lines.

Since my wish is to browse files as big as 'enwiki-20210101-pages-articles.xml' 75.6 GB (81,193,612,108 bytes) on machines with 32GB the current parser fits in, since this XML dump requires 18GB (1,218,205,075 lines x 16bytes).
I chose 8bytes to house the maximal line length, since there are DNA sequences (not wrapped) exceeding 4 billion chars.

Code: QB64: [Select]
  1. F:\_KAZE_CPU_Benchmark_Fuzzy-Search-Wikipedia_OCN>dir
  2.  
  3. 01/17/2021  09:12    81,193,612,108 enwiki-20210101-pages-articles.xml
  4. 02/24/2018  12:23            69,120 LineWordreporter.exe
  5. 01/17/2021  05:16            52,761 LineWordreporter_r1stats.zip
  6.  
  7. F:\_KAZE_CPU_Benchmark_Fuzzy-Search-Wikipedia_OCN>dir enwiki-20210101-pages-articles.xml/b 1>q
  8.  
  9. F:\_KAZE_CPU_Benchmark_Fuzzy-Search-Wikipedia_OCN>LineWordreporter.exe q
  10. LineWordreporter, revision 1_stats, written by Kaze.
  11. Purpose: Reports number of lines(LFs) and words in files from a given filelist.
  12. Example:
  13. D:\>LineWordreporter.exe LQ2048.lst
  14. Note1: Files can exceed 4GB limit.
  15. Note2: For CRLF ending lines i.e. Windows style you must add -1.
  16. Buffered counting ...
  17. Reading enwiki-20210101-pages-articles.xml ...
  18. Read (total) bytes so far: 81,193,612,108
  19. LineWordreporter: Encountered lines in all files: 1,218,205,075
  20. LineWordreporter: Encountered words in all files: 10,762,241,151
  21. LineWordreporter: Longest line: 1,756,520
  22. LineWordreporter: Longest word: 104,895
  23.  
  24. F:\_KAZE_CPU_Benchmark_Fuzzy-Search-Wikipedia_OCN>
  25.  

Will play more when have time...

I did the same thing but wrote in assembler. It saves all the file names with full paths in a file for later processing.  Walks the files on any USB stick or hard drive and it is fast.

Salute you, "trivial" task done well is what I enjoy.
He learns not to learn and reverts to what all men pass by.

Offline bplus

  • Forum Resident
  • Posts: 6241
  • What could possibly go wrong?
Re: A skeleton code for Text Scroller via Drag-and-Drop
« Reply #31 on: January 27, 2021, 12:58:49 PM »
This project reminds me of Text Fetch:
https://www.qb64.org/forum/index.php?topic=1876.msg111064#msg111064

and InForm version:
https://www.qb64.org/forum/index.php?topic=1874.msg111060#msg111060

No drag and drop, you just navigate your hard drive files and folders, load files into reader and select text for clipboard copy/paste.

Ha! looks like I used slow file loading method and I could update it now for Linux users, but not much interest when posted.


So far with Masakari, I've only managed to drag and drop 1 file into it. Have I a need of more instruction? It seems to be done after displaying 1 file but I did figure a trick for drag and drop from Windows Explorer drag to icon on toolbar which pops Masakari and then can drag/drop in.

What is Masakari anyway?
Quote
Dictionary source: Medieval Glossary. More: English to English translation of masakari. (masa-kari) is the Japanese word for an "axe" or a "hatchet", and is used to describe various tools of similar structure. As with axes in other cultures, ono are sometimes employed as weapons.

Ah! :)
« Last Edit: January 27, 2021, 01:17:14 PM by bplus »

Offline Sanmayce

  • Newbie
  • Posts: 18
  • Where is that English Text Sidekick?
    • Sanmayce's home
Re: A skeleton code for Text Scroller via Drag-and-Drop
« Reply #32 on: January 28, 2021, 05:13:15 AM »
What is Masakari anyway?

Literally, a broadaxe, metaphorically, a powerful tool/weapon to cut evil/ignorance, I have made this booklet: http://www.sanmayce.com/MSKR/masakari_3.pdf

Also, as a big fan of Mech Commander I loved those battle mechs, the most awesome one was Masakari - the brutal destroyer in the field.

Here comes Masakari r.5 ...   


 

Glad to share the first complete revision, featuring:

- Ability to load a filelist (a file containing filenames, with paths or not) pressing SPACE and the selected one will be loaded in the same window;
- Ability to browse files bigger than available RAM;
- The parser is 64bit i.e. 4+GB files can be uploaded in fast mode;
- Two modes, fast (memory used will be filesize+lines*16) and efficient (smaller memory footprint, just lines*16 bytes);
- Built-in benchmark #1: LShift+RShift - Reporting (in the status line in red color) the time for load;
- Built-in benchmark #2: LCtrl+RCtrl - Reporting (in the status line in red color) the time for PgDn-ing the entire file;

On my laptop (i5-7200u @2.5GHz, 36GB DDR4 2133MHz, Windows 10, Samsung 860 PRO 256GB SATA 3), current r.5 loaded the Oxford English Dictionary in 96 seconds, in its fast mode (i.e. ToLoadOrNotFlag = 1).
In its efficient mode (i.e. ToLoadOrNotFlag = 0) the load time was 3,342 seconds.

However, Task Manager shows Memory Usage: 1248MB for fast and 141MB for the efficient one.

 


The benchmark #2 simulates holding PgDn key until the end is reached, in fast mode it gave 1162 seconds, or 4,071,706 lines / 60 lines / 1162 seconds =  58 Pages-Per-Second.
The benchmark #2 simulates holding PgDn key until the end is reached, in efficient mode it gave 1241 seconds, or 4,071,706 lines / 60 lines / 1241 seconds = 54 Pages-Per-Second.

 


In the incoming days will run Wikipedia benchmark i.e. loading the 76GB in fast mode, curious to see how much time will take.

Wanna use this revision for a while to see what else could be refined and simplified, I need a quick browser in Linux and Windows command prompts, two keys only in order to browse all files in current folder and its branches, and by hitting space to load for browsing desired file - 3 keys in total:
1] m
2] enter
[mouse wheel, or arrows to choose from a filelist]
3] space

The first being a .sh file executing yoshi.elf and masakari.elf - the three placed in bin/

Edit: Had to fix one bug, namely:
Code: QB64: [Select]
  1.         'Ugh, buggy, in r.5 below two "malloc" lines should be using 'NumberOfLFs' not 'FileSize'...
  2.         MhandleOFF = _MEMNEW(8&& * (NumberOfLFs + 1)) 'create new memory block of 8*NumberOfLFs bytes - each line has its own 64bit Offset
  3.         MhandleLEN = _MEMNEW(8&& * (NumberOfLFs + 1)) 'create new memory block of 8*NumberOfLFs bytes - each line has its own 64bit Length, kinda overkill, could be 32bit
  4.  

Now loading Wikipedia dump...

Loaded successfully in ... few hours.

Two things to be updated, pressing the two Shifts has to be revised (status line is not wide enough for GB in size files):

- expand the window from 128 to at least 198, in 1920x1080, see the current line - this line will be in next revisions - I have plans those 60 columns to be used as a side window (with purple/indigo foreground and black background), sidekicking/assisting the left window - while they both being in one physical window.

In next days, I intend to make a single add-on, showing in this (already calling it 'Indigo') window the richest wordlist of English language - the Schizandrafield revision C - being 788,068,084 distinct-words strong! Crazy is good, isn't it, thus when loading some text in Masakari, the user would see whether the word (the phrase, in the future, also) where cursor is appear in corpora listed below, automatically the Indigo "window" will be positioned at the proper line (of course the Schizandrafield will not be loaded in RAM but searched as a fixed length field)...

- the reported seconds were negative, still don't have adequate function returning seconds when midnight is passed...

Edit, 2021-Jan-29:

Okay, did fix the above two issues, the load time in fast mode was 14048/3600= 4- hours (when have enough time will write faster parser), a side issue is that after the Windows 10 PRO update 2020 H2 OS Build 19042.746 my CPU was "downgraded" (CPU-Z and AIDA64 both report 13x multiplier instead of 31x) somehow from having 26GB/s Memory Read (3.1GHz max turbo) down to 16GB/s Memory Read (1.3GHz with no turbo), therefore results on other machines should be better. My first guess is that the hacking (installing 32GB stick which was said was impossible) triggered this behavior, updating BIOS could resolve it. Anyway, by chance my value of 65536MB for swap file in Windows was by chance enough in this Wikipedia load, 36GB + 64GB Virtual = 99GB, after loading the 76GB Wikipedia only 400MB were free, after some minutes Windows did automatic compression and freed ~8GB, so my advice is one to have 128GB virtual RAM since Wikipedia is steadily increased in size, AFAIR 6 years ago it was half that size.

 


Code: QB64: [Select]
  1. WKE_0,000,001_kaoudi

Three fields constitute the whole line (to be padded with spaces at the right):
3 bytes for TAG
2 bytes for two delimiters - the underscore
9 bytes for Number Of Occurrences
32 bytes for the word (the longest word in Heritage Dictionary was 31)
So, 3+2+9+32=46, even 14 more characters left.
The booklet is in .PDF here:

 

Code: [Select]
[Schizandrafield 1-gram Corpus, revision C, derives from next corpora:]

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Corpus Tag, Name                                                                                                     | Corpus size (in bytes) |     Total Words | Unique Words | Needed memory to rip in a single pass |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| AHD, American_Heritage_Dictionary_4_(En-En)_WHOLEWORDS.dsl                                                           |             41,742,099 |       7,083,439 |      176,377 |                              16,512KB |
| BNC, Machine-Learning_British-National-Corpus_XML-edition.tar                                                        |          4,680,140,800 |     980,238,337 |      367,921 |                              34,389KB |
| BRE, Britannica_Encyclopedia_2010_1.563_miled_(En-En)_ANSI.dsl                                                       |            297,981,779 |      46,958,317 |      356,417 |                              33,322KB |
| CAL, Cambridge_Advanced_Learner's_Dictionary_4th_Ed_(En-En).dsl                                                      |             76,257,922 |      13,994,351 |       57,787 |                               5,417KB |
| CCA, Collins_COBUILD_Advanced_Learner's_English_Dictionary_(5th_ed)_(En-En)_WHOLEWORDS.dsl                           |             22,960,485 |       4,553,432 |       61,575 |                               5,772KB |
| DMC, DeepMind_Q_and_A_Dataset_cnn_downloads_(92579_files).tar                                                        |          7,270,204,928 |     988,686,300 |      499,244 |                              46,623KB |
| DMD, DeepMind_Q_and_A_Dataset_dailymail_downloads_(219506_files).tar                                                 |         59,019,643,392 |   7,533,039,068 |      886,887 |                              82,582KB |
| EDR, encyclopediadramaticase-20150628-current.tar                                                                    |            302,959,104 |      41,642,066 |      456,251 |                              42,635KB |
| EJD, Encyclopaedia_Judaica_(in_22_volumes)_TXT.tar                                                                   |            107,784,192 |      16,158,364 |      195,273 |                              18,281KB |
| EJN, ENAMDICT_Japanese_names                                                                                         |             26,392,511 |       1,645,705 |      343,460 |                              32,106KB |
| FDU, For_Dummies_978-ebooks_Collection.tar                                                                           |            811,308,544 |     122,351,592 |      302,353 |                              28,282KB |
| GGB, Google_Books_corpus_version_20130501_English_All_Nodes.txt                                                      |         10,624,363,237 |     178,439,407 |    7,477,257 |                             664,635KB |
| HCN, Hacker_News_2006_to_2017-jul.json                                                                               |          7,046,506,518 |   1,075,832,079 |    2,188,058 |                             201,838KB |
| IST, INTERNET_SACRED_TEXT_ARCHIVE_DVD-ROM_9_(English_140479_htm_files).tar                                           |          2,037,880,832 |     304,410,076 |    1,333,036 |                             123,688KB |
| LDC, Longman_Dictionary_of_Contemporary_English_5th_Ed_(En-En)_WHOLEWORDS.dsl                                        |             52,870,741 |       9,722,686 |       85,217 |                               7,987KB |
| MCD, Macmillan_English_Dictionary_(En-En)_.dsl                                                                       |             79,686,074 |      11,750,813 |       67,340 |                               6,312KB |
| MCT, Macmillan_English_Thesaurus_(En-En).dsl                                                                         |             29,580,755 |       4,650,089 |       39,528 |                               3,708KB |
| NSO, New_Shorter_Oxford_English_Dictionary_fifth_edition.tar                                                         |            132,728,832 |      25,920,769 |      259,990 |                              24,321KB |
| OED, Oxford_English_Dictionary_2nd_Edition_Version_4_(En-En)_WHOLEWORDS.dsl.txt                                      |            564,235,251 |     101,798,550 |    1,089,240 |                             101,214KB |
| OSH, OSHO.TXT                                                                                                        |            206,908,949 |      31,957,006 |       58,893 |                               5,522KB |
| PGT, Project_Gutenberg_DVD-2010_(29180_files).tar                                                                    |         11,110,769,152 |   1,870,216,915 |    3,847,963 |                             350,566KB |
| RDD, Reddit_Comments_(JSON_objects)_from_(2005-12_to_2018-01).json                                                   |      2,277,975,364,152 | 333,829,940,270 |  689,949,388 |                          54,703,959KB |
| RHW, Random_House_Webster's_Unabridged_Dictionary_(En-En)_.dsl                                                       |             53,483,152 |       9,367,457 |      282,580 |                              26,428KB |
| SNT, Machine-Learning_WestburyLab.NonRedundant.UsenetCorpus_(47860_English_language_non-binary-file_news_groups).tar |         39,513,013,248 |   6,316,689,948 |    4,835,188 |                             437,662KB |
| STX, archive.org_stackexchange_(346_corpora_2017-Oct-12).tar                                                         |        274,935,801,856 |  38,077,068,727 |   29,194,792 |                           2,344,214KB |
| TAL, the-anarchist-library-2016-01-18-en_html.tar                                                                    |            153,703,936 |      24,339,935 |      136,000 |                              12,738KB |
| TXF, TEXTFILES.COM_(58096_files).tar                                                                                 |          1,382,122,496 |     192,893,874 |    1,008,780 |                              93,840KB |
| URB, Machine-Learning_Urban_Dictionary_Definitions_Corpus_(1999_-_May-2016).words.json                               |          1,917,822,288 |     263,253,093 |    2,631,962 |                             241,852KB |
| WKD, dumps.wikimedia.org_Germany_dewiki-20180220-pages-articles.xml                                                  |         18,954,897,343 |   2,362,729,484 |   17,415,343 |                           1,467,593KB |
| WKE, dumps.wikimedia.org_English_enwiki-20180220-pages-articles.xml                                                  |         65,865,333,874 |   8,739,196,084 |   39,440,894 |                           3,071,920KB |
| WKF, dumps.wikimedia.org_France_frwiki-20180220-pages-articles.xml                                                   |         17,802,386,071 |   2,429,769,009 |   12,192,025 |                           1,055,599KB |
| WKI, dumps.wikimedia.org_Italy_itwiki-20180220-pages-articles.xml                                                    |         10,887,321,918 |   1,372,430,005 |    8,960,466 |                             790,544KB |
| WKN, dumps.wikimedia.org_Netherlands_nlwiki-20180220-pages-articles.xml                                              |          6,808,875,477 |     800,283,886 |    8,596,100 |                             760,225KB |
| WKP, dumps.wikimedia.org_Portugal_ptwiki-20180220-pages-articles.xml                                                 |          6,891,588,341 |     940,571,722 |    6,736,349 |                             602,633KB |
| WKS, dumps.wikimedia.org_Spain_eswiki-20180220-pages-articles.xml                                                    |         12,200,295,384 |   1,682,091,803 |    9,780,910 |                             858,723KB |
| WMB, dumps.wikimedia.org_English_enwikibooks-20180220-pages-articles.xml                                             |            641,413,774 |      94,801,223 |      971,592 |                              90,438KB |
| WMN, dumps.wikimedia.org_English_enwikinews-20180220-pages-articles.xml                                              |            201,872,863 |      27,328,349 |      404,890 |                              37,839KB |
| WMP, dumps.wikimedia.org_English_specieswiki-20180220-pages-articles.xml                                             |          1,009,303,358 |     107,856,282 |    2,765,079 |                             254,016KB |
| WMQ, dumps.wikimedia.org_English_enwikiquote-20180220-pages-articles.xml                                             |            410,396,147 |      64,809,361 |      561,894 |                              52,455KB |
| WMS, dumps.wikimedia.org_English_enwikisource-20180220-pages-articles.xml                                            |          8,352,677,820 |   1,282,920,260 |    8,547,509 |                             755,716KB |
| WMU, dumps.wikimedia.org_English_enwikiversity-20180220-pages-articles.xml                                           |            369,987,893 |      52,322,592 |      640,606 |                              59,767KB |
| WMV, dumps.wikimedia.org_English_enwikivoyage-20180220-pages-articles.xml                                            |            354,919,743 |      49,701,293 |      734,403 |                              68,474KB |
| WMW, dumps.wikimedia.org_English_enwiktionary-20180220-pages-articles.xml                                            |          5,379,842,566 |     597,315,425 |   14,743,543 |                           1,259,179KB |
| WUD, Webster's_Unabridged_3_(En-En)_WHOLEWORDS_ANSI.dsl                                                              |            134,706,719 |      24,014,478 |      364,352 |                              34,052KB |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[Schizandrafield 1-gram Corpus, revision C, holds within next tagged-counted-wordlists:]

   237,859,509 dumps.wikimedia.org_Spain_eswiki-20180220-pages-articles.1gram
   161,745,186 dumps.wikimedia.org_Portugal_ptwiki-20180220-pages-articles.1gram
   211,751,899 dumps.wikimedia.org_Netherlands_nlwiki-20180220-pages-articles.1gram
   216,667,608 dumps.wikimedia.org_Italy_itwiki-20180220-pages-articles.1gram
   969,819,544 dumps.wikimedia.org_English_enwiki-20180220-pages-articles.1gram
   452,471,230 dumps.wikimedia.org_Germany_dewiki-20180220-pages-articles.1gram
   294,822,021 dumps.wikimedia.org_France_frwiki-20180220-pages-articles.1gram
    23,100,108 dumps.wikimedia.org_English_enwikibooks-20180220-pages-articles.1gram
   204,113,124 dumps.wikimedia.org_English_enwikisource-20180220-pages-articles.1gram
     9,347,743 dumps.wikimedia.org_English_enwikinews-20180220-pages-articles.1gram
    13,259,587 dumps.wikimedia.org_English_enwikiquote-20180220-pages-articles.1gram
    15,131,207 dumps.wikimedia.org_English_enwikiversity-20180220-pages-articles.1gram
    18,031,161 dumps.wikimedia.org_English_enwikivoyage-20180220-pages-articles.1gram
   355,881,851 dumps.wikimedia.org_English_enwiktionary-20180220-pages-articles.1gram
    65,846,403 dumps.wikimedia.org_English_specieswiki-20180220-pages-articles.1gram
     4,653,581 Encyclopaedia_Judaica_(in_22_volumes)_TXT.1gram
     8,889,960 Webster's_Unabridged_3_(En-En)_WHOLEWORDS_ANSI.dsl.1gram
     3,288,498 the-anarchist-library-2016-01-18-en_html.tar.1gram
     6,973,687 Random_House_Webster's_Unabridged_Dictionary_(En-En)_.dsl.1gram
    26,501,920 Oxford_English_Dictionary_2nd_Edition_Version_4_(En-En)_WHOLEWORDS.dsl.txt.1gram
     1,404,700 OSHO.TXT.1gram
     6,280,919 New_Shorter_Oxford_English_Dictionary_fifth_edition.tar.1gram
       941,039 Macmillan_English_Thesaurus_(En-En).dsl.1gram
     1,609,423 Macmillan_English_Dictionary_(En-En)_.dsl.1gram
     2,044,933 Longman_Dictionary_of_Contemporary_English_5th_Ed_(En-En)_WHOLEWORDS.dsl.1gram
     7,354,174 For_Dummies_978-ebooks_Collection.tar.1gram
    11,464,911 encyclopediadramaticase-20150628-current.tar.1gram
     8,840,362 ENAMDICT_Japanese_names.1gram
     1,466,506 Collins_COBUILD_Advanced_Learner's_English_Dictionary_(5th_ed)_(En-En)_WHOLEWORDS.dsl.1gram
     1,380,759 Cambridge_Advanced_Learner's_Dictionary_4th_Ed_(En-En).dsl.1gram
     8,621,181 Britannica_Encyclopedia_2010_1.563_miled_(En-En)_ANSI.dsl.1gram
     4,268,314 American_Heritage_Dictionary_4_(En-En)_WHOLEWORDS.dsl.1gram
   186,006,261 Google_Books_corpus_version_20130501_English_All_Nodes.1gram
    32,262,044 INTERNET_SACRED_TEXT_ARCHIVE_DVD-ROM_9_(English_140479_htm_files).1gram
    96,997,413 Project_Gutenberg_DVD-2010_(29180_files).1gram
    24,179,167 TEXTFILES.COM_(58096_files).1gram
   797,764,729 archive.org_stackexchange_(346_corpora_2017-Oct-12).1gram
     8,823,193 Machine-Learning_British-National-Corpus_XML-edition.1gram
    65,616,610 Machine-Learning_Urban_Dictionary_Definitions_Corpus_(1999_-_May-2016).words.1gram
   120,939,472 Machine-Learning_WestburyLab.NonRedundant.UsenetCorpus_(47860_English_language_non-binary-file_news_groups).1gram
    12,375,594 DeepMind_Q_and_A_Dataset_cnn_downloads_(92579_files).1gram
    20,878,468 DeepMind_Q_and_A_Dataset_dailymail_downloads_(219506_files).1gram
16,261,229,827 reddit.1gram
    54,744,506 Hacker_News_2006_to_2017-jul.json.1gram

[The way they were ripped:]

E:\Schizandrafield_workshop>dir dumps.wikimedia.org_English_enwiki-20180220-pages-articles.xml/b  >dumps.wikimedia.org_English_enwiki-20180220-pages-articles.lst
E:\Schizandrafield_workshop>echo WKE>Leprechaun.tag
E:\Schizandrafield_workshop>Leprechaun_x-leton_32bit_Intel_01_008p.exe dumps.wikimedia.org_English_enwiki-20180220-pages-articles.lst dumps.wikimedia.org_English_enwiki-20180220-pages-articles.1gram 1399888 Y
...
E:\Schizandrafield_workshop>type dumps.wikimedia.org_English_enwiki-20180220-pages-articles.1gram|more
WKE_0,000,003_byjnf
WKE_0,000,001_richardmullaney
WKE_0,000,001_vfycdm
WKE_0,000,001_kaoudi
WKE_0,000,001_bzolqoptifgvgptya
WKE_0,000,016_bristolcombination
WKE_0,000,001_hermanaustinamstate
WKE_0,000,004_meweyqhwn
WKE_0,000,002_habsfrontcricket
WKE_0,000,001_hobergsclub
WKE_0,000,001_kwagedle
WKE_0,000,001_buvvhel
WKE_0,000,001_wnjyat
WKE_0,000,001_dilmenpeds
WKE_0,000,001_ovldtxrjaaqiy
...
E:\Schizandrafield_workshop>

[The way they were concatenated/sorted:]

C:\>copy/b *.1gram unsorted
C:\>sort.exe /+15 /M 1048576 /T d: "unsorted" /O "Schizandrafield_Corpus_revision_C_(44-corpora_-unique-words).sorted"

Thanks to my brother, here the video and the Linux trio (m,yoshi,masakari - just place them in folder with path to it) come:

« Last Edit: January 29, 2021, 08:30:34 AM by Sanmayce »
He learns not to learn and reverts to what all men pass by.

Offline Sanmayce

  • Newbie
  • Posts: 18
  • Where is that English Text Sidekick?
    • Sanmayce's home
Re: A skeleton code for Text Scroller via Drag-and-Drop
« Reply #33 on: February 01, 2021, 12:46:21 PM »
Sanmayce's Haikuoid inspired by "the first kill" (Brian Geraghty's mourningfulness) scene from "The Hurt Locker" movie:
Gunlight
Daylight
Sunset
Loss


My word is for Zennish sensitivity... to be activated and the artistism within to reveal, by itself, nifty practical etudes.

In r.6 were added:

- Four different sub-variants: r6_Fast_Wrapper, r6_Slow_Wrapper, r6_Fast_Vanilla, r6_Slow_Vanilla:
-- Fast/Slow mode - decided by the variable 'ToLoadOrNotFlag', 1/0 respectively
-- Wrapper/Vanilla mode - decided by the variable 'WrapFlag', 1/0 respectively

- When loading, there is progress bar in the 'Status Line'
- In the 'Status Line' now reporting Current Line Number (was reluctant to add it since my plan for streaming was in conflict, meh)

- Title is now containing the actual size of the text window, plus the filename of loaded file
- The command prompt now offers "help" i.e. all the keyboard and mouse combos.

- In order to browse via mouse only, I have come up with two quite ergonomic (the left hand hitting 'm' and 'Enter' and 'Space', while the right is on the mouse non-stop) mouse combos, mimicking 'LCtrl+Home' and 'LCtrl+End':
-- Button 2 + Wheel Up - going to the top left position
-- Button 2 + Wheel Dn - going to the bottom left position

In r.6 were fixed:

- Now, Left Mouse click cannot be placed past last line (when loaded lines are less than Window Height i.e. YdimROW)

The screenshot shows Oxford English Dictionary loaded by r6_Fast_Wrapper_64bit on my crippled laptop with i5-7200U 1.3GHz (32+4)GB:

 


The screenshot shows English Wikipedia XML dump from 2021-Jan-01 loaded by r6_Fast_Vanilla_64bit on laptop with Ryzen 7 4800H 4300MHz, 2x32GB 3200MHz:

 


Also, a batch file compiling the four sub-variants is included:

Code: [Select]
E:\_KAZE_Smxrt_Benchmarks\QB64_kit_v1.4_2.48 GB\qb64>_Make_EXEs.bat

E:\_KAZE_Smxrt_Benchmarks\QB64_kit_v1.4_2.48 GB\qb64>qb64 -c MASAKARI_r6_Fast_Vanilla.BAS

E:\_KAZE_Smxrt_Benchmarks\QB64_kit_v1.4_2.48 GB\qb64>qb64 -c MASAKARI_r6_Fast_Wrapper.BAS

E:\_KAZE_Smxrt_Benchmarks\QB64_kit_v1.4_2.48 GB\qb64>qb64 -c MASAKARI_r6_Slow_Vanilla.BAS

E:\_KAZE_Smxrt_Benchmarks\QB64_kit_v1.4_2.48 GB\qb64>qb64 -c MASAKARI_r6_Slow_Wrapper.BAS

E:\_KAZE_Smxrt_Benchmarks\QB64_kit_v1.4_2.48 GB\qb64>fc MASAKARI_r6_Fast_Vanilla.BAS MASAKARI_r6_Fast_Wrapper.BAS
Comparing files MASAKARI_r6_Fast_Vanilla.BAS and MASAKARI_R6_FAST_WRAPPER.BAS
***** MASAKARI_r6_Fast_Vanilla.BAS

WrapFlag = 0 ' 1 means wrapping
ToLoadOrNotFlag = 1 ' 1 means fast load but memory greedy; 0 means slow load but memory efficient
***** MASAKARI_R6_FAST_WRAPPER.BAS

WrapFlag = 1 ' 1 means wrapping
ToLoadOrNotFlag = 1 ' 1 means fast load but memory greedy; 0 means slow load but memory efficient
*****


E:\_KAZE_Smxrt_Benchmarks\QB64_kit_v1.4_2.48 GB\qb64>fc MASAKARI_r6_Fast_Vanilla.BAS MASAKARI_r6_Slow_Vanilla.BAS
Comparing files MASAKARI_r6_Fast_Vanilla.BAS and MASAKARI_R6_SLOW_VANILLA.BAS
***** MASAKARI_r6_Fast_Vanilla.BAS
WrapFlag = 0 ' 1 means wrapping
ToLoadOrNotFlag = 1 ' 1 means fast load but memory greedy; 0 means slow load but memory efficient

***** MASAKARI_R6_SLOW_VANILLA.BAS
WrapFlag = 0 ' 1 means wrapping
ToLoadOrNotFlag = 0 ' 1 means fast load but memory greedy; 0 means slow load but memory efficient

*****


E:\_KAZE_Smxrt_Benchmarks\QB64_kit_v1.4_2.48 GB\qb64>fc MASAKARI_r6_Fast_Vanilla.BAS MASAKARI_r6_Slow_Wrapper.BAS
Comparing files MASAKARI_r6_Fast_Vanilla.BAS and MASAKARI_R6_SLOW_WRAPPER.BAS
***** MASAKARI_r6_Fast_Vanilla.BAS

WrapFlag = 0 ' 1 means wrapping
ToLoadOrNotFlag = 1 ' 1 means fast load but memory greedy; 0 means slow load but memory efficient

***** MASAKARI_R6_SLOW_WRAPPER.BAS

WrapFlag = 1 ' 1 means wrapping
ToLoadOrNotFlag = 0 ' 1 means fast load but memory greedy; 0 means slow load but memory efficient

*****

In the attached .ZIP package, there is README.TXT:

Code: [Select]
README.TXT

A quick .DIZ for Masakari, revision 6

Masakari is a free and open source tool with simplistic GUI aiming at sidekicking browsing text files in Linux/Windows command prompts.

This .ZIP package contains:

```
02/01/2021  03:16 PM             4,446 README.TXT                                  ! This file

01/29/2021  03:05 PM               171 m                                           ! Linux .sh file invokes yoshi and masakari
01/29/2021  03:05 PM           336,644 yoshi                                       ! Linux .ELF 64bit
02/01/2021  06:43 PM           331,852 masakari                                    ! Linux .ELF 64bit same as 'MASAKARI_r6_Fast_Vanilla_64bit'

02/01/2021  03:16 PM         1,453,926 ColumnChart.ico                             ! Needed during compilation
02/01/2021  03:16 PM             2,991 MEM.H                                       ! Needed during compilation

02/01/2021  03:16 PM               492 m.bat                                       ! Accepts wildcards, invokes Yoshi.exe and MASAKARI_r6_Fast_Vanilla_64bit.exe
02/01/2021  03:16 PM            52,633 MASAKARI_r6_Fast_Vanilla.BAS
02/01/2021  03:16 PM         3,288,576 MASAKARI_r6_Fast_Vanilla_32bit.exe
02/01/2021  03:16 PM         3,647,488 MASAKARI_r6_Fast_Vanilla_64bit.exe
02/01/2021  06:43 PM           331,852 MASAKARI_r6_Fast_Vanilla_64bit.elf
02/01/2021  03:16 PM            52,633 MASAKARI_r6_Fast_Wrapper.BAS
02/01/2021  03:16 PM         3,288,576 MASAKARI_r6_Fast_Wrapper_32bit.exe
02/01/2021  03:16 PM         3,647,488 MASAKARI_r6_Fast_Wrapper_64bit.exe
02/01/2021  06:43 PM           331,852 MASAKARI_r6_Fast_Wrapper_64bit.elf
02/01/2021  03:16 PM            52,633 MASAKARI_r6_Slow_Vanilla.BAS
02/01/2021  03:16 PM         3,288,576 MASAKARI_r6_Slow_Vanilla_32bit.exe
02/01/2021  03:16 PM         3,647,488 MASAKARI_r6_Slow_Vanilla_64bit.exe
02/01/2021  06:44 PM           331,852 MASAKARI_r6_Slow_Vanilla_64bit.elf
02/01/2021  03:16 PM            52,633 MASAKARI_r6_Slow_Wrapper.BAS
02/01/2021  03:16 PM         3,288,576 MASAKARI_r6_Slow_Wrapper_32bit.exe
02/01/2021  03:16 PM         3,647,488 MASAKARI_r6_Slow_Wrapper_64bit.exe
02/01/2021  06:44 PM           331,860 MASAKARI_r6_Slow_Wrapper_64bit.elf

02/01/2021  03:16 PM           660,672 M_r6_OED.png                                ! Screenshot of Oxford English Dictionary loaded by Fast_Wrapper
02/01/2021  03:16 PM           416,939 warwalk_180hue-39sat_x3_NearestNeighbor.gif ! The logo

02/01/2021  03:16 PM               240 z.bat
02/01/2021  03:16 PM               340 _Make_EXEs.bat

02/01/2021  03:16 PM            41,814 Yoshi.exe                                   ! tool for generating 'dir/b' and 'ls' like output
02/01/2021  03:16 PM           972,575 Yoshi7-.zip

02/01/2021  03:16 PM                39 _MAKE_HEX_dump.bat                          ! give it a filename to generate HEX output in text format
02/01/2021  03:16 PM           111,328 DUMP_HEX_header.c
02/01/2021  03:16 PM            77,312 DUMP_HEX_header.exe
02/01/2021  03:16 PM            69,120 LineWordreporter.exe
02/01/2021  03:16 PM            52,761 LineWordreporter_r1stats.zip
```

New releases and announcements at https://twitter.com/Sanmayce

This is how the it looks like in the prompt:

```
E:\qb64>MASAKARI_r6_Fast_Vanilla.exe -h
Masakari, revision 6_Fast_Vanilla, written in QB64 by Kaze, source code downloadable at https://www.qb64.org/forum
Usage: Masakari filename|/help
Currently are implemented only:
Mouse:
      Button 1 - sets the cursor and the inverse line to the chosen position
      Button 3 - PgDn
      Wheel Up - Up
      Wheel Dn - Dn
      Button 2 + Wheel Up - going to the top left position
      Button 2 + Wheel Dn - going to the bottom left position
Keyboard:
       Up
       Down
       Left - still no sideways scroll
       Right - still no sideways scroll
       LCtrl+Home - going to the top left position
       LCtrl+End  - going to the bottom left position
       Alt+X or Alt+Q - quit to the system, without demanding keypress.
       Space - loads the highlighted line (if it is an actual filename)
Benchmarking:
       LShift+RShift - Reporting (in the status line in red color) the time for load
       LCtrl+RCtrl - Reporting (in the status line in red color) the time for PgDn-ing the entire file

E:\qb64>
```

Enfun!

Kaze,
2021-Jan-31

In the incoming revision 7, there will be an additional parser (like the 'Slow' sub-variant, now) NOT loading the entire file, in RAM will be loaded only the 16bytes (Offset+Size) descriptors ... in a buffered way. I fully expect those nasty 4672 seconds to be shrunk to few minutes.
« Last Edit: February 01, 2021, 12:48:05 PM by Sanmayce »
He learns not to learn and reverts to what all men pass by.

Offline Sanmayce

  • Newbie
  • Posts: 18
  • Where is that English Text Sidekick?
    • Sanmayce's home
Re: A skeleton code for Text Scroller via Drag-and-Drop
« Reply #34 on: February 03, 2021, 04:58:44 AM »
Did write the new parser.
Will upload Masakari revision 7 tomorrow, busy with other things, want to add the old 'LCtrl+S' combo spell-checking the current window...

Yet, want to share the first known to me 64bit parser of CRLF or LF  (or both, since Linux and Windows text files could be part of a .tar file), 87 lines in length:

Code: QB64: [Select]
  1.                 ReadBytes = 0 ' Have to load the 2+GB "malloc" in chunks...
  2.                 chunk128KB$ = SPACE$(128 * 1024 * 1024)
  3.                 DO WHILE ReadBytes + (128 * 1024 * 1024) < FileSize
  4.                     GET #1, , chunk128KB$
  5.                     'IF INSTR(1, chunk128KB$, CHR$(0)) THEN PRINT "Termination! 'Null' character encountered.": _DISPLAY: END
  6.                     I = 1
  7.                     DO WHILE I <= (128 * 1024 * 1024)
  8.                         I = INSTR(I, chunk128KB$, CHR$(10))
  9.                         IF I = 0 THEN EXIT DO
  10.                         NumberOfLFs = NumberOfLFs + 1
  11.                         I = I + 1
  12.                     LOOP
  13.                     ReadBytes = ReadBytes + (128 * 1024 * 1024)
  14.                 LOOP
  15.                 IF (FileSize - ReadBytes) THEN
  16.                     RemainingChunk$ = SPACE$(FileSize - ReadBytes)
  17.                     GET #1, , RemainingChunk$
  18.                     'IF INSTR(1, RemainingChunk$, CHR$(0)) THEN PRINT "Termination! 'Null' character encountered.": _DISPLAY: END
  19.                     I = 1
  20.                     DO WHILE I <= LEN(RemainingChunk$)
  21.                         I = INSTR(I, RemainingChunk$, CHR$(10))
  22.                         IF I = 0 THEN EXIT DO
  23.                         NumberOfLFs = NumberOfLFs + 1
  24.                         I = I + 1
  25.                     LOOP
  26.                 END IF
  27.  
  28.             ' Parser for r.7 [[[[[[
  29.             IF ToLoadOrNotFlag = 0 THEN
  30.                 ReadBytes = 0 ' Have to load the 2+GB "malloc" in chunks...
  31.                 ChunkLen = 128 * 1024
  32.                 chunk128KB$ = SPACE$(ChunkLen) '+1 for sentinel
  33.                 j = 1 ' is the current offset where new GET reads
  34.                 QWORDlast = 1
  35.                 PrevByte = ""
  36.                 DO WHILE ReadBytes + (ChunkLen) < FileSize ' this '<' is important, on purpose not using '<=' since there should be a remnant chunk (where LF postfixing is enforced, eventually)
  37.                     LastByte = RIGHT$(chunk128KB$, 1) 'to handle eventual CR, left behind i.e. in previous chunk
  38.                     GET #1, j, chunk128KB$
  39.                     I = 1
  40.                     FoundAt = INSTR(I, chunk128KB$, CHR$(10))
  41.                     IF FoundAt THEN
  42.                         DO WHILE FoundAt
  43.                             IF FoundAt = 1 THEN PrevByte = LastByte ELSE PrevByte = MID$(chunk128KB$, FoundAt - 1, 1)
  44.                             QWORD = (j - 1) + FoundAt
  45.                             LineLen13 = QWORD - QWORDlast
  46.                             _MEMPUT MhandleOFF, MhandleOFF.OFFSET + 8&& * filecount, QWORDlast
  47.                             QWORDlast = QWORD + 1
  48.                             IF PrevByte = CHR$(13) THEN LineLen13 = LineLen13 - 1
  49.                             _MEMPUT MhandleLEN, MhandleLEN.OFFSET + 8&& * filecount, LineLen13
  50.                             IF LineLen13 > LongestLine THEN LongestLine = LineLen13
  51.                             filecount = filecount + 1
  52.                             I = FoundAt + 1
  53.                             IF I > (ChunkLen) THEN EXIT DO 'could use sentinel (buffer+1), in order this line to drop out
  54.                             FoundAt = INSTR(I, chunk128KB$, CHR$(10))
  55.                         LOOP
  56.                     END IF
  57.                     j = j + (ChunkLen)
  58.                     ReadBytes = ReadBytes + (ChunkLen)
  59.                 LOOP
  60.                 IF (FileSize - ReadBytes) THEN
  61.                     RemainingChunk$ = SPACE$(FileSize - ReadBytes) '+1 for sentinel
  62.                     LastByte = RIGHT$(chunk128KB$, 1) 'to handle eventual CR, left behind i.e. in previous chunk
  63.                     GET #1, , RemainingChunk$
  64.                     IF RIGHT$(RemainingChunk$, 1) <> CHR$(10) THEN RemainingChunk$ = RemainingChunk$ + CHR$(10) ' dirty, enforcing not missing the last line (if it is not postfixed with LF)
  65.                     'Beware, yes be aware that above line should have been applied for above/first fragment because the filesize could be multiple of the chunk length i.e. no remaining chunk, however it was feinted by '<'
  66.                     I = 1
  67.                     FoundAt = INSTR(I, RemainingChunk$, CHR$(10))
  68.                     IF FoundAt THEN
  69.                         DO WHILE FoundAt
  70.                             IF FoundAt = 1 THEN PrevByte = LastByte ELSE PrevByte = MID$(RemainingChunk$, FoundAt - 1, 1)
  71.                             QWORD = (j - 1) + FoundAt
  72.                             LineLen13 = QWORD - QWORDlast
  73.                             _MEMPUT MhandleOFF, MhandleOFF.OFFSET + 8&& * filecount, QWORDlast
  74.                             QWORDlast = QWORD + 1
  75.                             IF PrevByte = CHR$(13) THEN LineLen13 = LineLen13 - 1
  76.                             _MEMPUT MhandleLEN, MhandleLEN.OFFSET + 8&& * filecount, LineLen13
  77.                             IF LineLen13 > LongestLine THEN LongestLine = LineLen13
  78.                             filecount = filecount + 1
  79.                             I = FoundAt + 1
  80.                             IF I > (FileSize - ReadBytes) THEN EXIT DO 'could use sentinel (buffer+1), in order this line to drop out
  81.                             FoundAt = INSTR(I, RemainingChunk$, CHR$(10))
  82.                         LOOP
  83.                     END IF
  84.                 END IF
  85.             END IF
  86.             SEEK #1, 1
  87.             ' Parser for r.7 ]]]]]]
  88.  

The speed is MUTSI! Oxford English Dictionary was loaded in 5 seconds on my 1.3GHz i5-7200U, the memory usage is roughly 400MB:

 


Also tried the vanilla (non-wrapped) English Wikipedia, the load time is 1188 seconds, the memory usage is around 19GB:

 


Also, the new revision, already, handles unwrappable files (such as Wikipedia) by dumping all unwrappable lines into filename+".unwrappable" while wrapping all the rest, and reports in the status line how many lines are unwrappable, excited to see in next days for the first time Wikipedia loaded in readable format, expecting some 6 billion lines... Edit: Actually 1.4 billion, only plus 200 million since the 198 chars wide window accommodated most of them:

The actual wrapped Wikipedia loaded in 375 seconds:

 
« Last Edit: February 05, 2021, 10:31:01 PM by Sanmayce »
He learns not to learn and reverts to what all men pass by.

Offline Sanmayce

  • Newbie
  • Posts: 18
  • Where is that English Text Sidekick?
    • Sanmayce's home
Re: A skeleton code for Text Scroller via Drag-and-Drop
« Reply #35 on: February 05, 2021, 09:55:33 AM »
So glad to share the first operational revision that is the actual fulcrum/skeleton.

Code: [Select]
README.TXT

A quick .DIZ for Masakari, revision 7

Masakari is a free and open source tool with simplistic GUI aiming at sidekicking browsing text files in Linux/Windows command prompts.

This .ZIP package contains:

```
README.TXT                                  ! This file

catalog_2021-02-05_16-20-22.png             ! Screenshot of how 'masakari.tgz' was created
masakari.tgz:                               ! Contains the trio, with attributes (executable flag) set:
 m                                          ! Linux .sh file invokes yoshi and masakari
 yoshi                                      ! Linux .ELF 64bit
 masakari                                   ! Linux .ELF 64bit same as 'MASAKARI_r7_Vanilla_64bit'

ColumnChart.ico                             ! Needed during compilation
MEM.H                                       ! Needed during compilation

m.bat                                       ! Accepts wildcards, invokes Yoshi.exe and MASAKARI_r7_Vanilla_64bit.exe

MASAKARI_r7_Vanilla.BAS
MASAKARI_r7_Vanilla_32bit.exe
MASAKARI_r7_Vanilla_64bit.elf
MASAKARI_r7_Vanilla_64bit.exe
MASAKARI_r7_Wrapper.BAS
MASAKARI_r7_Wrapper_32bit.exe
MASAKARI_r7_Wrapper_64bit.elf
MASAKARI_r7_Wrapper_64bit.exe

MASAKARI_r7_Vanilla_OED.png                 ! Screenshot of Oxford English Dictionary loaded by Vanilla
MASAKARI_r7_Vanilla_Wikipedia.png           ! Screenshot of English Wikipedia loaded by Vanilla
warwalk_180hue-39sat_x3_NearestNeighbor.gif ! The logo

z.bat
_Make_EXEs.bat

Yoshi.exe                                   ! tool for generating 'dir/b' and 'ls' like output
Yoshi7-.zip

_MAKE_HEX_dump.bat                          ! give it a filename to generate HEX output in text format
DUMP_HEX_header.c
DUMP_HEX_header.exe
LineWordreporter.exe
LineWordreporter_r1stats.zip
```

New releases and announcements at https://twitter.com/Sanmayce

This is how it looks like in the prompt:

```
E:\qb64>MASAKARI_r7_Vanilla_32bit.exe -h
Masakari, revision 7_Vanilla, written in QB64 by Kaze, source code downloadable at https://www.qb64.org/forum
Usage: Masakari filename|/help
Currently are implemented only:
Mouse:
      Button 1 - sets the cursor and the inverse line to the chosen position
      Button 3 - PgDn
      Wheel Up - Up
      Wheel Dn - Dn
      Button 2 + Wheel Up - going to the top left position
      Button 2 + Wheel Dn - going to the bottom left position
      Button 2 + dragging (from left to right, or from right to left) for at least 90 columns/cells (within 2 seconds) - same as Alt+X, Alt+Q
      Button 2 + dragging (from top to bottom) for at least 5 lines/cells (within 2 seconds) - same as PgUp
      Button 2 + dragging (from bottom to top) for at least 5 lines/cells (within 2 seconds) - same as PgDn
Keyboard:
       Up
       Down
       PgUp
       PgDn
       Left - still no sideways scroll
       Right - still no sideways scroll
       LCtrl+Home - going to the top left position
       LCtrl+End  - going to the bottom left position
       Alt+X or Alt+Q - quit to the system, without demanding keypress.
       Space - loads the highlighted line (if it is an actual filename)
Benchmarking:
       LShift+RShift - Reporting (in the status line in red color) the time for load
       LCtrl+RCtrl - Reporting (in the status line in red color) the time for PgDn-ing the entire file
Note1: The 'Vanilla' sub-variant loads textual files without wrapping the lines.
Note2: The 'Wrapper' sub-variant makes the text file viewable without side/lateral scroll,
       If unwrappable lines exist then those lines are dumped to filename+".unwrappable",
       otherwise, the wrapped lines are dumped to filename+".wrapped", and auto-loaded.
       If wrapped file exists during start then it is used, not re-created.

E:\qb64>
```

Enfun!

Kaze,
2021-Feb-05
He learns not to learn and reverts to what all men pass by.

Offline Sanmayce

  • Newbie
  • Posts: 18
  • Where is that English Text Sidekick?
    • Sanmayce's home
Re: A skeleton code for Text Scroller via Drag-and-Drop
« Reply #36 on: February 19, 2021, 08:09:58 AM »
So glad to announce that the first Masakari/Kazahana duo is ready, on Monday I will make and upload on YouTube a video clip how the whole Wikipedia is traversed from Masakari via Kazahana (the fastest exact/wildcard/fuzzy scalar searcher on Internet)... The *nix users (along with Windows ones) will be able to download a package from here with precompiled static binaries (.elf and .exe), and the SOURCE CODE, and how to compile it. Of course, for 64bit as well.

No time at the moment, but here is how it looks like on Windows XP 32bit:

The drag-and-drop, waiting for a file:

 


The Search Panel:

 


My brother's fast testmachine (AMD 4800H, 16 threads, 64GB RAM, m2 SSD with 2400MB/s) will be used, but it cannot keep up with the awesomeness of Kazahana, it devours the read data with its 16-threads thus reaching 3000MB/s search rates, to be seen...
He learns not to learn and reverts to what all men pass by.

Offline Sanmayce

  • Newbie
  • Posts: 18
  • Where is that English Text Sidekick?
    • Sanmayce's home
Re: A skeleton code for Text Scroller via Drag-and-Drop
« Reply #37 on: February 22, 2021, 02:46:11 PM »
For the first time the fastest fulltext search console tool can be run from Masakari.

Step #1:

 


Step #2:

 


The search rate was 1500+MB/s or 48 seconds for fulltext searching in Wikipedia, not that good because the nvme SSD could do 2400MB/s, for some reason Windows didn't allow it, in next weeks will retest the same benchmark on Linux.

I did the video, however not satisfied with the result, too many blurred scenes, yet, it shows:
- decompressing the .zip package;
- sending the shortcut of Masakari.exe to the Desktop;
- running the shortcut;
- drag-and-drop the Wikipedia;
- searching with Kazahana;
- auto-scrolling with LAlt+RAlt combo, and pressing PgUp key in order to advance ... backwards while the scroll is on (line-by-line).



Was curious how one of the best editors (UltraEdit v.28) deals with the same benchmark, here is what I saw:

UltraEdit v.28:
Load Time: 0s (it has streaming approach)
Fulltext Search Time: 17min 10s

Masakari r7+:
Load Time: 309s (it has indexing approach)
Fulltext Search Time: 49s

Wikipedia can be downloaded at:
https://dumps.wikimedia.org/enwiki/
« Last Edit: February 24, 2021, 08:37:52 AM by Sanmayce »
He learns not to learn and reverts to what all men pass by.

Offline NOVARSEG

  • Seasoned Forum Regular
  • Posts: 299
Re: A skeleton code for Text Scroller via Drag-and-Drop
« Reply #38 on: February 23, 2021, 12:41:37 AM »
@Sanmayce

 chunk128KB$ = SPACE$(128 * 1024 * 1024)


should be 128MB$  ?

****
Quote
I = 1
                    DO WHILE I <= (128 * 1024 * 1024)
                        I = INSTR(I, chunk128KB$, CHR$(10))
                        IF I = 0 THEN EXIT DO
                        NumberOfLFs = NumberOfLFs + 1
                        I = I + 1
                    LOOP

I think that INSTR returns the position of CHR$(10) in the string

from wiki
Quote
The function returns the position% in the baseString$ where the searchString$ was found.
so

I = 0
                        DO
                        I = INSTR(1 + I, chunk128KB$, CHR$(10))
                        IF I = 0 THEN EXIT DO
                        NumberOfLFs = NumberOfLFs + 1
                       
                    LOOP
When the code exits the LOOP there is likely some bytes that are left over at the end of the string that might be data for the next line.

not tested

« Last Edit: February 23, 2021, 01:15:46 AM by NOVARSEG »

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3414
    • Steve’s QB64 Archive Forum
Re: A skeleton code for Text Scroller via Drag-and-Drop
« Reply #39 on: February 23, 2021, 02:32:28 AM »
@Sanmayce

 chunk128KB$ = SPACE$(128 * 1024 * 1024)


should be 128MB$  ?

****
I think that INSTR returns the position of CHR$(10) in the string

from wikiso

I = 0
                        DO
                        I = INSTR(1 + I, chunk128KB$, CHR$(10))
                        IF I = 0 THEN EXIT DO
                        NumberOfLFs = NumberOfLFs + 1
                       
                    LOOP
When the code exits the LOOP there is likely some bytes that are left over at the end of the string that might be data for the next line.

not tested

CHR$(10) is the EOL character (End of Line).  Anything past it goes on the next line.
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline NOVARSEG

  • Seasoned Forum Regular
  • Posts: 299
Re: A skeleton code for Text Scroller via Drag-and-Drop
« Reply #40 on: February 23, 2021, 03:38:14 AM »
A general CHUNK processor

Code: QB64: [Select]
  1. ReadBytes = 0 ' Have to load the 2+GB "malloc" in chunks...
  2.  
  3. Filesize = LOF(1)
  4. DIM Chunksize AS _UNSIGNED LONG
  5. DIM ReadBytes AS _UNSIGNED LONG
  6. Chunksize = 128 * 1024 * 1024
  7. DIM String1 AS STRING
  8.  
  9.     IF Filesize = 0 THEN EXIT DO
  10.  
  11.  
  12.     IF Filesize <= Chunksize THEN
  13.         ReadBytes = Filesize
  14.         Filesize = 0
  15.         GOSUB process
  16.     END IF
  17.  
  18.  
  19.     IF Filesize > Chunksize THEN
  20.      ReadBytes = Chunksize  
  21.      Filesize = Filesize - Chunksize
  22.      GOSUB process
  23.     END IF
  24.  
  25.  
  26.  
  27.  
  28.  
  29. process:
  30. String1 = SPACE$(ReadBytes)
  31.  
  32. GET #1, , String1
  33.  
  34.  
  35. ' ETC
  36.  
« Last Edit: February 23, 2021, 04:34:10 AM by NOVARSEG »

Offline Sanmayce

  • Newbie
  • Posts: 18
  • Where is that English Text Sidekick?
    • Sanmayce's home
Re: A skeleton code for Text Scroller via Drag-and-Drop
« Reply #41 on: February 23, 2021, 09:32:46 AM »
@NOVARSEG

Thanks, you are right, but the possible line left behind unhandled is "COVERED" by adding one extra line in the "malloc" lines, which are right behind the loopS searching for number of LFs:

Code: QB64: [Select]
  1.             MhandleOFF = _MEMNEW(8&& * (NumberOfLFs + 1)) 'create new memory block of 8*NumberOfLFs bytes - each line has its own 64bit Offset
  2.             MhandleLEN = _MEMNEW(8&& * (NumberOfLFs + 1)) 'create new memory block of 8*NumberOfLFs bytes - each line has its own 64bit Length, kinda overkill, could be 32bit
  3.  

The idea is simply to know the number of LF in the incoming file, if the file doesn't end in LF, the above +1 ensures we don't miss the last line.
The best way to count is not INSTR() nor memchr(), it is a manual/dedicated vectorized function counting all the LFs in the vector (not as vectorized memchr() which stops at first hit). Of course for best speeds, it has to be multi-threaded, I found that 4 threads are enough to saturate the memory read of MAIN RAM. Imagine the scenario you have 24GB + 77GB or just machine with 128GB RAM, then the whole Wikipedia will be cached, in such scenario search speed goes up 2x or 3x, as a minimum, I tested a text corpus 400+ million lines (subtitles from movies) 13GB strong, and Kazahana reported 4+GB/s:

 


If you are interested in what I say, see NyoTengu's source code, in there the fastest search, known to me, is done achieving 30+GB/s search rates.
Always glad to talk to coders appreciating the details... ask me what you see as NG (no good), will try in next months to address it.

Add-on, Feb-25:

Always, it is a good idea to have one's sources in hardcopy i.e. on paper, in .PDF, here they are:

MASAKARI_r8_Vanilla.BAS:

 


TetraNyoTengu.c:

Upload Denied, size limitation...

 
   

In revision 8, new things:

- Ability to browse non-text files, in contrast to old revisions, the characters 00..31 are converted to 32;
- Added Mouse counterparts to single PgUp and PgDn, the first is mapped onto Double-Left-Click, the second onto Double-Right-Click;
« Last Edit: February 25, 2021, 07:36:12 PM by Sanmayce »
He learns not to learn and reverts to what all men pass by.