Talk:Help/Index A

From J Wiki
Jump to navigation Jump to search

It's tempting to imagine that manually curated indices have been made obsolete by search engines.

However, spammers (and misleading search engine optimization efforts) have been wreaking considerable havoc on search engines. Also, search engines have difficulty distinguish between "domain relevant" and "irrelevant" language use. And there are other issues.

Long story short: manually curated indices are still useful and are likely to remain useful.

That said, there's still considerable benefit from using specialized software to build and update these indices.

Stay tuned...


When working on the index pages, having a relatively recent local copy of the wiki content can be invaluable. Wiki search is great when working manually, but if you are building tools, having a cached copy of the content can speed up your work immensely.

But the wiki has a lot of "exposed infrastructure", so automated tools designed for other contexts might struggle to get what we are interested in here (which is the content of the Main namespace). The mediawiki actually provides ways of bulk downloading these pages, and that's probably a good way to go.

But, here's another approach. (This might take an hour or so to pull down a cached copy of the current wiki):

NB. j wiki spider
require'web/gethttp'
SRC=: 'https://code.jsoftware.com/wiki/'
DEST=: '~user/wikidir/'

mkdir=: [: 1!:5 ::0:@<@jpath@>@|. [: dirname&.>^:a: <
dirname=: {.~ i:&'/'
dirpath=: {{
  ext=.x{::'html';'wsrc'
  safe=. y rplc'.';'.dot';'*';'.star';'//';'/slash.'
  DEST,ext,'/',safe,'.',ext
}}
snap=: I.{[
pans=: (<:@I. { [)~
snapr=: pans&.:-&.:|.
locs=: I.@E.
loc=: {.@locs

editurl=: {{
  assert. SRC-: (#SRC){.y
  assert. -.'?#' e. y
  'https://code.jsoftware.com/mediawiki/index.php?title=',(#SRC)}.y,'&action=edit'
}}

NB. this version assumes no nesting and no conflicting comments
NB. good enough for now -- conceptually fixable 
NB. (for example, extracting only top level 
NB.  and explicitly ignoring '< and '>' characters in comments)
NB. this approach avoids unneeded complexity (for now)
innerhtmls=: {{
  opens=. 1+('>' locs y) snap ('<',x) locs tolower y
  closes=. ('</',x) locs tolower y
  y {L:0~ opens (+ i.)each closes-opens
}} L:0

outerhtmls=: {{
  opens=. ('<',x) locs tolower y
  closes=. ('>' locs y) snap ('</',x) locs tolower y
  y {L:0~ opens (+ i.)each closes-opens
}} L:0

getcache=: {{
  assert. SRC-: (#SRC){.y
  h=. 0 dirpath (#SRC)}.y
  w=. 1 dirpath (#SRC)}.y
  if. 2>#H=. fread h do.
    mkdir dirname h
    H=. '-sL' gethttp y
    H fwrite h
  end.
  if. 2>#W=. fread w do.
    mkdir dirname w
    W=. rplc&('<';'<';'&';'&') ;'textarea' innerhtmls '-sL' gethttp editurl y
    W fwrite w
  end.
  H;W
}}

rawlinks=: {{
  htm=. tolower y
  a=. '<a ' locs htm
  href=. 'href='
  ah=. a snapr href locs htm
  ae=. a snapr '>' locs htm
  ur=. (#href)+(ah < ae)#ah
  us=. (I.htm e.' >') snap ur
  y trimquote@:{L:0~ur (+ i.)&.> us-ur
}}

dests=: {{ 6}.&.>~.(#~ ('/wiki/'-:6&{.)@>)({.~i.&'#')&.>rawlinks y }}

maindests=: {{ (#~ 0=':'&e.@>) dests y }}

trimquote=: {{
  if. '""' -: ({.,{:) y do. }.}: y return. end.
  if. '''''' -: ({.,{:) y do. }.}: y return. end.
  y
}}

spider=: {{
  old=. next=. ''
  todo=. <'Help/Pcre/PCRE_Index'
  whilst.-.old-:todo do.
    echo (":#old=.todo),' ',next
    todo=. ~.todo, maindests '-Ls' gethttp 'https://code.jsoftware.com/mediawiki/index.php?title=Special:AllPages&from=',next
    next=.;{:todo
  end.
  j=. 0
  while.j<#todo do.
    echo url=. SRC,j{::todo
    todo=. ~.todo,maindests 0{::getcache url
    j=. j+1
  end.
  (;todo,&.>LF) fwrite DEST,'index.txt'
  #todo
}}

Here spider'' will grab both rendered html and wiki markup representations most of the current wiki pages.