* NetResearch: Hidden CS Topics

  • Space efficiency vs. time efficiency: This is perhaps the most fundamental tradeoff in all of computer science, and it appears (in disguise) in several aspects of Internet searching.

    Chapter 5, Web Searching Techniques, explains that search engines do not search the whole Internet every time they receive a query, for such an operation would require weeks or months. Instead, search engines engines build catalogs (indexes) of URLs for convenient searching. This additional use of space by the search engine results in several orders of magnitude of speedup.

    Chapter 10, Finding Information Again, introduces the concept of bookmarks for revisiting Web pages quickly. Bookmarks are distinguished from copying a Web page to one's hard drive. A copy may be accessed more rapidly than the original, saving time, but it takes up space on one's hard drive.

    Chapter 10 also briefly mentions caching, yet another space/time tradeoff. An instructor could expand upon the material in the book. The most popular Web browsers openly discuss caching in their Help and Options windows, so users need practical knowledge of what caching is, why it requires additional memory or hard drive space, and why it's helpful.

  • Black box vs. white box: Chapter 3, Views of the Internet, explains that the Internet may be viewed in several useful ways, such as a collection of networked computers, or as a vast library that accepts queries and responds with answers. These two views correspond to the complementary concepts "white box" and "black box" from software engineering, respectively.

    A white box view reveals the inner workings of a mechanism; in this case the Internet internally is a bunch of computers connected by wires, satellite links, etc. A black box view keeps the inner workings hidden; in this case, a Web search engine hides the computer network behind a simple query-and-response interface.

    These sections of NetResearch might be used to introduce the benefits and tradeoffs of data abstraction. For example, by hiding the network, we permit the network to change without altering the (search engine) interface. By revealing the network, we allow direct connections to be made between machines when desired. Each of these views is beneficial to effective Internet searching under different circumstances.

  • Procedural vs. data-directed programming models: Chapter 3 also introduces two other views of the Internet: as a collection of programs (email reader, web browser, Usenet newsreader, etc.) or a collection of resources (Yahoo, movie review archives, the New York Times web site, etc.). These views correspond to two complementary models of programming, "procedural" and "data-directed," respectively.

    The procedural model specifies how a computation should be accomplished. In this case, the computation is an Internet search, and the question of how is answered in various ways: with a Web browser, with a Usenet newsreader, with an email program, etc. By learning how to use certain tools very well, one becomes a better searcher.

    The data-directed model specifies what is the desired result, without specifying how it should be obtained. In the domain of Internet searching, this is like stating the kind of resource you want to locate: an online phone book, a legal resource, the San Diego Zoo, a collection of freely distributable software, the Library of Congress, etc. By learning how to seek particular resources without relying on particular tools, one becomes a better searcher.

  • Databases and queries: Online search engines, presented in the most detail in chapter 5, Web Searching Strategies, are effectively databases with colorful front-ends. An in-class treatment of search engines could be extended into a discussion of more general database concepts, such as consistency, distribution, and relations.

  • Active vs. passive entities: Chapter 5, Web Searching Strategies, explains the difference between active search engines, which seek out web pages, and passive search engines, which require web pages to be registered. The active/passive distinction is pervasive throughout computer science, however, and this could be emphasized in class. For example, in client/server computing, servers may initiate connections with clients (active) or wait for connections from clients (passive). In error handling, a program can produce error message in response to bad input (passive), or prevent bad input from being entered in the first place (active). And so on. Each model has tradeoffs relating to efficiency, consistency management, and other factors.

  • Shallow copy vs. deep copy: Another fundamental computer science concept is the distinction between copying and referencing an object. If multiple copies of an object may exist, such as remote and cached copies of a web page, then inconsistencies can develop between the copies (hence, the existence of the Reload button of web browsers). If only one copy exists but may be referenced from multiple locations, as in bookmarking, efficiency may suffer. This general issue is pervasive throughout computer programming, networking, fault-tolerance, databases, and so on, and may be called to the attention of students.

  • Composition: In computer science, solutions to problems are often composed to produce solutions to larger problems. On the web, composition is found in the guise of meta-search engines (Chapter 5) that forward queries to other search engines. Meta-search is fraught with tradoffs relating to efficiency, consistency, distribution, etc., and NetResearch brings these tradeoffs into a context non-majors can understand.

  • Consistency maintenance: The Net is a constantly-changing environment, which is why no search engine can catalog the entire Internet, and why there is no global directory of Internet users. Any such catalogs or directories would rapidly go out of date, presenting a serious consistency maintenance problem. NetResearch explains the difficulties behind this issue.

    Consistency maintenance is not limited to the Internet, of course. It appears throughout computer science: distributed databases, software engineering, etc. Once students understand the consistency issues of the Internet, these more advanced topics may be briefly introduced.

  • Heuristics: Internet searching is an inexact science. Search engine queries are imprecise attempts to convince a search engine to cough up a relevant page, and search engines themselves can only approximate the relevance of a page to a given query. More than any other computer-related application, Internet searching reveals to the average user that computation is not always an exact procedure directed by zeroes and ones. A heuristic is a technique that doesn't always work, but is likely to work most of the time. Both people and search engines use heuristics all the time on the Net.

    The search strategies covered in Chapter 5, Web Searching Techniques, are shown to be imprecise but still highly useful. The instructor might use this material to introduce the concept of heuristics and spur discussion. Relevant questions for the students might be: Why can't we make searches 100% precise? Why is it hard for search engines to determine the relevance of a web page to a given query? If computers got 1000 times faster, would searches become 1000 times more precise? What are some advantages and disadvantages of heuristics? Can you think of situations where you use heuristics in your daily life? Can you think of situations where heuristics would absolutely not be acceptable?

    The ranking functions used by search engines to sort their responses by relevance, described as part of the "search-and-rank" strategy, are also heuristics.

  • Booleans and logical operations: Many search engines allow queries to contain logical operators such as AND, OR, NOT, and grouping. Proper use of these operators is essential for making queries more precise, and improper use can turn a reasonable-looking query into a magnet for irrelevant pages. For example, the query

    Captain Kirk AND Mr Spock

    might appear to search for two characters from Star Trek, but in fact, a search engine could interpret it as

    Captain OR (Kirk AND Mr) OR Spock

    which has a different meaning and will match many irrelevant pages.

    NetResearch provides an introduction to these operators, and the instructor may investigate them in more detail, perhaps introducing truth tables and precedence rules.

  • Basic parsing of expressions: As search engine technology matures, the engines become capable of handling complex queries such as

    "Captain Kirk" AND NOT (Spock OR (Vulcan and "pointed ears") AND (episode (ten OR 10))

    This provides a familiar, practical example of parsing. The instructor could use this to lead basic discussions on parsing, compilers, the relevance of human language theory and grammar to computer languages, natural language recognition, and related topics.

  • Natural Language Recognition: Why do we have to type complicated Boolean expressions to initiate searches? Why can't we just use plain human language? Although NetResearch does not address these questions directly, they represent a natural direction for in-class discussion.

  • False positives vs. false negatives: (Also called "Type 1 Error vs. Type 2 Error.") When formulating a search query, is it better to be too specific, attempting to minimize false negatives, or too general, attempting to minimize false positives? This tradeoff is found throughout the experimental sciences, and Internet searching brings it to the general population. Chapter 5 discusses it implicitly when presenting General Search and Specific Search.
  • Back to Instructor's Guide

    <p> <hr> This page is maintained by <A HREF="mailto:dbarrett@ora.com"> Daniel J. Barrett, dbarrett@ora.com</A>