Log files contain a wealth of data. But if you don't know how to read them, they seem like chicken scratchings. This tutorial will show you what's in your log files, and how to see where others came from, re-running their queries.
LOG FILES What's in a log file? Well, your standard who, what, when and where. And if you work your way through this little tutorial, you'll be able to re-run your user's web searches to see who your competition really is. Really!
"Who" --- the first item is usually an IP number or fully qualified domain name.
"What" -- What they saw, which web page, and what they are using to view it, giving you some demographics information
"When" -- When they came. And with that, more information about their habits.
"Where" -- (Ok, this is a stretch.) Where in the sequence of pages they saw your page, or where they came from. (Not all servers provide this information.)
Here's a sample record from one of my log files. (It's really all on one line, but we had to fold it for publication.)
This is a proxy from AOL. It represents one viewer, but others likely see the same page you serve them, as they cache it and re-serve it. How many more, will depend upon how popular your pages are. This is how you identify many (not all) search engine crawlers.
WHEN: [24/Aug/2001:23:13:23 -0700]
Time of day. When do the bulk of your visitors come? Is it the home browsing crowd, or the guys at work? Or might they be from another part of the world? (Check the who for something other than .com.)
Saw the free magazine page, 200 means they saw it (404 would mean it was not there, 304 would mean it wasn't modified, and they got it from their browser's cache,having viewed it before.) The next number tells us the server sent 63855 bytes. You can count the number of times the page was hit, and see the number of pages that they asked for which you DIDN'T have -- yet! Those are the 404's. (I hope /default.ida is one of your 404's, not 200! That's the code red worm trying to crawl into your system. I got 156 of those probes so far.)
This is the page he was looking at when he clicked on my link. (Not all ISPs provide it. You should demand it in your logs.) And since it was a search engine, I can see he was searching for "free magazine" and "playboy". (Must not have read the title of my page, or he was going market research.) This information is called psychographics -- the psychology of the visitor. Note that this HTTP_REFERER information is not provided by every browser, and is not always accurate.
You can get this data from the HTTP_REFERER variable with scripts, then have the script feed your visitor exactly what he or she is looking for. You can also re-run his searches, or the more common searches, and see how you rate, what your competition is, etc. By the end of this tutorial, you'll know how to do that.
This is what he was browsing with, and what operating system he was running. More hinting at his or her demographics. Mac user? Probably more graphics oriented. Linux user? Probably more technical. Netscape? Could be power user, or just an MS hater. Win95? Legacy systems, either not rich, or does not care much about computers.
Ok, so there are too many entries in your log file to do this on a one-by-one basis. So maybe you just want page counts. Well, there are plenty of tools for that. But... if you learn these simple things, you can do what the common tools can't do.
GETTING YOUR HANDS ON IT
If you are able to log in to a UNIX shell via Telnet, a few simple tools can let you do a LOT of analysis, including re-running your visitors queries.
Computers are interactive, like swimming; you have to thrash the water to get anywhere. I don't have the feel that things work if I don't get on the computer and actually type the commands as I read about them. That's the way I felt when I just cracked the manuals back in college, learning to program in two weeks on my own before the classes got started; and that's the way I still feel, decades later. So log on to on your server, and try these examples as we go. No cut and pasting! Print this file and type the commands yourself, as you have to teach your fingers the words.
The utility "grep" and "egrep" are like grappling hooks, letting you grasp select records and extract them. For example, if you wanted to see all the references to /portfolio.html was referenced in your logs, you would say:
grep /protfolio.html your-log-file
(Come on, try this, the water is fine! All right, so maybe it feels freezing cold. You'll get use to it! Come on, try it! Call your tech support line if you don't know what your log file is, or if you don't know how to log on.)
To see what was fetched from aol.com, you would type
grep aol.com your-log-file
To see what aol users got your portfolio page, you would combine the two requests this way, feeding the output of one command into another using the vertical bar as a pipe.
grep /portfolio.html your-log-file | grep aol.com
Ok, let's say you are running a promo banner at Octopus Search, that new engine you heard about last month. How many come from there? To find out, you created a special landing page to link the banner to, octofree.html. So we grep for /octofree.html Then to count the lines, we use "wc", which stands for word count, with the -l option limiting the output to show only the number of lines.
Or to see who is using a Macintosh ("Mac" is the common abbreviation, catching Macintosh, Mac_PowerPC, and others. You may also get some false matches, but not that many.)
grep Mac your-log-file | wc -l
THE BIG EXAMPLE
What if you wanted a count on all of your pages, sorted by popularity? That's a little more complicated, but still in the range of a one-line command.
To count, we need to count like things, and there are a lot of things in your log file. We will cut the log file to include only the file names, then sort it so we can count similar records, counting up the number for each page. Once that is done, we will re-sort that list to give it to us with the most visited pages on top.
Let's build the command step by step, watching that we get the right output each time. That's the way I started years ago, and that method still serves me fine. Oh, up arrow lets you edit previous command lines on many UNIX / LINUX systems.
To get it started, we use "cat", the catenate function. It's like DOS's "type" command. ("Unix is a dog, so use cat") So ask your systems person where your log file is, and use that instead of my "your-log-file" label.
Runs off the screen! Well, we can stop that by using "more", to get less. (Some systems also have a "less" command which does more.) We pipe the output of "cat" into the "more" command using the vertical bar, what we sometimes call the "or" symbol. (I always call it the or symbol, confusing people.)
cat your-log-file | more
Now, we add the cut command to extract only the file name. We need to specify a field delimiter, which will be a blank. That's -d " ". Then we have to tell it which field we want to look at, which in _my_ case is fields 7 through 7. So the command becomes:
cat your-log-file | cut -d " " -f 7-7 | more
You may need to adjust the 7-7 to match your particular system's log file layout; that's why we are building the command up step-by-step. That, and so you will understand how to modify all this for other purposes. When this runs right, you should just see page names, like /portfolio.html, /index.html, etc. Play with the numbers till you see only the file names.
Next, we want to count each file name. But to count, we need to group all the same file names together, so we have to sort before we count.
Because we sorted the file, all the file names are in alphabetical order. We can re-sort them to get the file in popularity order using another sort, using n to tell it to sort numerically, so that 21 comes before 101, and then reverse the order with r so the most popular pages are on top.
And there we have it! Sure, you CAN use webtrends, but what if you just want the page counts that come from AOL; or just from a specific client of yours? Or how many hits you got from Google? Remember the grep for aol.com?
Step by step, we do the following:
1. cat the log file, cut it at field 11 (which you may need to change, so start building up the command as we did above, inspecting each step for reasonable output),
2. sort it so we can count it,
3. use uniq -c to count how many times each reference is used,
4. use sed, the Stream Editor, to edit (or Swap) the first character "." (meaning any single character) with itself "&" and "<BR>", then use another swap s/old/new/ to put a link around the referencing URL
5.use the ">" to put it into a file so you can look at it with a web browser and click on the links to see what they saw.
Simple, once you know how. And if you play with it a little, you WILL know how.
Two other UNIX / LINUX commands you really need to know are "man" and "apropos". Man gives you the manual pages on the specified command:
Apropos tells you which commands might be appropriate:
Some UNIX systems also use info instead of man.
Depending on the system you use, you might want to pipe apropos to more so you get less on the screen at one time.
apropos edit | more
This kind of play with your log data can be a lot of fun! Unix isn't just an operating system, it's a language you can use to describe and extract all kinds of fascinating information from your web logs and other files.
In some of my consulting, I use these types of commands to get a bit of information quickly, then if there is enough there to make it worth while, I write PERL scripts to generate more elaborate reports for marketing, etc.
If you have a quick question or two, e-mail me or call me at 408-779-9842. (I am in the process of moving, in part due to connectivity issues, so will be in and out, and the number will change soon.) If you want some thing more elaborate, then let's talk about your needs and your budget.
javilk(at)mall-net.com ------------------- webmaster(at)Mall-Net.com
------------------------- IMAGINEERING -------------------------
------------------ Every mouse click, a Vote ------------------
----------- Do they vote For, or Against your pages? -----------
------- What people want: http://www.SitePsych.com/free/ -------
-- Check your page: http://www.SitePsych.com/sanitycheck.html --
- We have the reports, products and services to help you Grow! -
--- Web Imagineering -- Architecture to Programming CGI-BIN ----