Session #112 - Nov 10, 2017 - FRIDAY - Linux at Lunch
Session #112 - Nov 10, 2017 - FRIDAY - Linux at Lunch - KEY WORD SCRIPT walk through!
(skipped a session on 3 Nov 2017)
The Key Word Script uses regular expressions:
grep, sed, awk, cut, uniq, sort, perl, tr and vi to extract KEY WORDS from a text document!
Key Words are those hopefully of proper names and places; words beginning with capital letters.
We'll walk through this script, line by line,
which has already spun off a few modified versions to create different content.
The output of this script produces two Key-Word variants, one that shows only the line where the word appears,
and another that shows the context, grep -C 1, listing the line before and the line after.
It's possible to insert HTML coding into this script to create a web page as I have demonstrated previously,
with: http://johnmeister.com/linux/Scripts/mk-webpage-2015-05-08.html
I know that there are GUI based tools like WordPress that might create nicer pages...
but these tools have security issues, require a database, and are not trivial to install, learn, use or manage.
SIMPLE ALWAYS WORKS.
It's just interesting to see the names and places listed in just ONE of the 66 books:
http://johnmeister.com/bible/BibleSummary/KEY-WORDs/01-GENESIS/ZLIST-names-01k_Gen.txt
http://johnmeister.com/bible/BibleSummary/KEY-WORDs/GET-KEY.sh.txt
(will create a copy of this on johnmeister.com/linux that will work with generic text files.)
------------------------------------------------------------
#!/bin/bash
### john meister 29 october 2017 - copyright 2017
### script to find Key words - upper case - Who and Where
USAGE="GET-KEY.sh.txt , e.g. GET-KEY.sh.txt 02n_Exo.txt"
#########################################################################
cat $1 | tr -d '[{}()\;\":,\!\.\?]' > T2
### remove most special characters cat T1, filter, save as T2
perl -pi -e 's/ /\n/g' T2
### put all words on one line
grep -v "^'" T2 > T3
### remove leading ticks from list save as T3
cat T3 | sort | uniq | grep -v ^[a-z] | grep -v ^[0-9] > T4
### sort, remove lc and numbers, save as T4
rm -f T1 T2 T3
### deleting temporary files
##########################################################################
echo "edit T4 in vi to remove non-keywords; :wq will resume; hit any key to start"
read
vi T4 # T4 will be moved to ZLIST-names when done
#########################
OUT="KEY-names-with-VS-$1" # Key names with counts and verses
OUT2="TIMES-names-$1" # Key names with counts only
cat /dev/null > $OUT
for Z in `cat T4`
do
echo "----------------------------------------------------------------------" | tee -a $OUT
grep $Z $1 | wc -l > T1
####################
echo "The word \"$Z\" in \"$1\" is found `cat T1| cut -c 5-9` times." | tee -a $OUT2
echo "======================================================================" | tee -a $OUT2
####################
echo "The word \"$Z\" in \"$1\" occurs in these verses `cat T1| cut -c 6-9` times:" | tee -a $OUT
echo "----------------------------------------------------------------------" | tee -a $OUT
grep -C 1 $Z $1 | tee -a $OUT
echo "======================================================================" | tee -a $OUT
done
###########################################################
mv T4 ZLIST-names-$1 ; rm -f T1
##########################################################
So, the extract files look like this for the following command(s):
GET-KEY.sh.txt 01k_Gen.txt
and
GET-KEY.sh.txt 01n_Gen.txt
KEY-names-with-VS-01k_Gen.txt 2017-10-30 00:40 1.2M
KEY-names-with-VS-01n_Gen.txt 2017-10-30 00:40 1.2M
TIMES-names-01k_Gen.txt 2017-10-30 00:40 69K
TIMES-names-01n_Gen.txt 2017-10-30 00:40 69K
ZLIST-names-01k_Gen.txt 2017-10-30 00:40 3.8K
ZLIST-names-01n_Gen.txt 2017-10-30 00:40 3.9K
In addition, I want to modify the output to only show the references:
http://johnmeister.com/bible/BibleSummary/KEY-WORDs/01-GENESIS/KEY-names-with-VS-01k_Gen.txt
e.g.
----------------------------------------------------------------------
The word "Achbor" in "01k_Gen.txt" occurs 2 times: Gen 36:38; Gen 36:39
======================================================================
or:
----------------------------------------------------------------------
The word "Achbor" in "01k_Gen.txt" occurs 2 times:
Gen 36:38; Gen 36:39
======================================================================
SIMPLE is portable, secure and easily modified; but it may take more effort and understanding,
and may include more "steps" and sometimes requires manual intervention; NOTICE the USE of "read" in the
script above. I had to manually edit the "key word" list to remove unimportant words like
"The", "And", "Because", etc. And a few other items that didn't filter out properly.
It is possible that I could create a "heredoc" to filter those words, but after converting the strings
to single words with new lines any grep -v filters would take out the rest of any words starting out
with the same beginning (e.g And would remove Andover, etc.).
That would push the filtering into the text prior to the new line insertion,
and since one has no idea which words are "important" or of interest, better to remove them manually.
|