Intro to
Computer
Science

Spring 2021
course
site
-->

chapter 5 : strings, lists, and files

This next material looks at some data structures, containers of sequential data of various types.

Here's a summary of the topics to read about and understand :

Just as integers and floats are data types, all of these are also data types. You can think of each as a box that other things can be put into.

You access the stuff inside strings and list with square brackets after them. This is the same idea as subscripts in math. If for example we have some numbers x = [100, 101, 102] then the math and computing conventions to represent each would be :

math: \( x_0, x_1, x_2 \)

code: x[0], x[1], x[2]

For the most part, the slides for this chapter do a pretty good job of summarizing the syntax, so I won't repeat all that here. I will work through some examples of the syntax in a video, if you'd rather hear an explanation than read the book or the slides.

But here in the notes I'll just go over a few specifics.

ASCII

Inside the computer, all data is 1's and 0's. We want some of those bits to represent characters, so that we can store text. One common way to turn the 1's and 0's into letters is called ASCII ... an acronym that I can never remember.

You should read about it at for example wikipedia: ASCII

Each ASCII character is a number from 0 to 128, with a corresponding letter or teletype code. Python has two functions, ord() and chr(), to turn one into the other.

>>> ord('a')
97
>>> chr(97)
'a'

Often times these numbers are written in hex (we've talked about hex a bit already), since they are 7 bits of data and so fit within 1 byte, or two hex digits.

These days most software understands what you an think of an an extension of ASCII called utf8 , which uses longer numbers to encode most symbols ever written in any human language ... and a lot of others too. See for example wikipedia: UTF_8 for the details.

Python3 strings are made up of utf8 characters. As long as the string is made up of only ASCII characters, its length in bytes and length in characters is the same. But non-Western utf8 characters take up more than one byte to represent, so the length of the string may not be the same as the number of bytes it takes to represent it, or what it takes to put it into a file.

(This is one of the big differences between python2 and python3; in python2, strings were made up of bytes.)

In this course, we will usually just stick to ASCII.

But just to show you the difference ...

>>> s = "Δ and ๗ or 葉"
>>> print(s)
Δ and  or 
>>> len(s)           # 12 utf-8 characters
12
>>> s[0]             # the 0th character
'Δ'
>>> b = s.encode()   # convert to bytes
>>> b                # \x94 means the hex byte 0x94 
b'\xce\x94 and \xe0\xb9\x97 or \xe8\x91\x89'
>>> type(b)
<class 'bytes'>
>>> len(b)           # 17 bytes of data
17
>>> b[0]
206
>>> b[3]
97
>>> chr(b[3])
'a'

operators

As you all know, + and * are operators that can work with numbers to produce other numbers.

>>> 1 + 2
3
>>> 3 * 2
6

In python (and many other languages), operators are also used on other types of data to produce various things. The details depend on the language. Here's how + and * work in python on strings and lists.

>>> "one" + "two"
"onetwo"
>>> [1,2,3] + [4,5,6]
[1, 2, 3, 4, 5, 6]

The addition operator, +, is used to stick two strings or lists together, making a new one with all the elements from the old ones. The old lists or strings are left unchanged.

Since multiplication is repeated addition, the authors of the language decided to let multiplication stick together many copies.

>>> "one" * 3
"oneoneone"
>>> [1,2,3] * 4
[1,2,3,1,2,3,1,2,3,1,2,3]

Trying to add a string or a list to a different type gives an error.

The subtraction operator, -, is not defined for either of these types, so that gives an error too.

>>> "hello" + [1,2,3]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can only concatenate str (not "list") to str

>>> [1,2,3] + 4
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can only concatenate list (not "int") to list

>>> [1,2,3] + [4]   # stick together two lists ... this works.
[1, 2, 3, 4]

>>> [1,2,3] - [3]   # ... but this doesn't do anything.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for -: 'list' and 'list'

Subtraction is usually used only for numbers. Though there is one case in python where it is used for something else. It's a bit outside the scope of what's in this chapter, but just for fun ... python has sets (math sets - no order, no repeated elements) which can be made with {}. And subtracting sets removes the elements of the second from the first.

python

>>> {1,2,3,4} - {3,2}  # from one set, remove some elements
{1, 4}
>>> {1,2,2,2,2}        # sets can't contain repeated elements
{1,2}

accumulator pattern - revisited

We've already seen how to add up numbers using this technique :

numbers = [20, 30, 50, 1]
total = 0
for n in numbers:
    total = total + n
print("The total is " + str(total))

This same approach can be used with lists and strings to build up new lists and strings, one element at a time.

word = "factory"
backwards = ""                  # start with an empty string
for w in word:                  # for each char 'f', 'a', ...
    backwards = w + backwards   # put it at beginning
print(word +  " backwards is ", backwards)

working with files

We commonly want to read or write to files. In python, the steps look something like this.

file = open('some_file.txt', 'r')  # get read access 
text = file.read()                 # grab the data

In the first line, we create a "file object" which has methods that let us manipulate the file. There are several variations including .readline() (get the next line of text) and .readlines() (get all lines as a list).

Writing to a file is similar.

file = open('some_file.txt', 'w')  # get write access 
file.write("Hi Mom!")              # put in the data

Another variation is .append(text), which adds data to the end of the file.

In both cases, the file object keeps track of a current file position, so that reading or writing more than once continues from where you left off.

The reading and writing files python documentation gives the details.

It's worth mentioning lines, line endings, and the mostly invisible characters that are used to mark the end of the line. It turns out that historically, different computer makers adopted different conventions.

Teletypes, old mechanical typewrite-like things that used to be what computers were connected to, were often sent two characters at the end of the line : CR (carriage return), to send the head from the end of the line back to the beginning, and LF (line feed) to turn the paper feed down to the next empty line.

See this wikipedia: newline article for all the gory details.

These days, lines end with a special "newline" character which in python and many other languages is denoted "\n". That's one character, not two; the backslash often has special meaning in strings. Similarly, "\t" is the tab character.

When reading and writing from files, you need to take care to handle these "whitespace" characters correctly. There are in fact special string methods such as string.strip() to get rid of them.

For example, suppose you have a file

------ names.txt -----
John
Mary
Santa
----------------------

Then in python, reading this could give

>>> f = open('names.txt', 'r')
>>> f.read()
'John\nMary\nSanta\n'

The '\n' are the newline characters at the end of each line.

If we wanted to read in the names to a list, then we could try

>>> f = open('names.txt')   # 'r' is the default
>>> lines = f.readlines()   # slurp all the lines into a list
>>> names = []              # initialize an empty list
>>> for line in lines:      # loop over each line in the file
...    names.append(line)   # add each line to the list of names
>>> print(names)
['John\n', 'Mary\n', 'Santa\n']

This is close, but those newline characters are still there at the end of each line. To get rid of them, the simplest is to use the .strip() method which removes whitespace (spaces, tabs, newlines) from the start and end of the string.

Here's one that works.

>>> file = open('names.txt')
>>> names = []              
>>> for line in file.readlines():      
...    names.append(line.strip())  # this time, remove newlines
>>> print(names)
['John', 'Mary', 'Santa']

(That is the "accumulator pattern" again : initialize something, then put more stuff into it in a loop.)


Whew! That was a lot of stuff.

I'm sure you'll have questions ... come find me in a zoomy session as you work through the coding exercises.

complications

I suggest you look through the text and slides and practice a bit with lists and strings before studying the rest of these notes - what follows are complexities that are mostly python specific and which you don't need to master right away.

"in place" vs "return new object"

Some of the list and string methods modify objects in place, while others return a new object. Python does not do a good job of making the distinction between these two different things clear, and that can lead to errors.

Here are two examples to show you what I'm talking about.

First, to create a list and then add something to it, the syntax is :

>>> stuff = [10, "red", 3.2]  # a list made up an integer, a string, and a float
>>> stuff.append(103)         # this changes stuff. Often called an "in place" modification.
>>> print(stuff)
[10, "red", 3.2, 103]

This is possible because lists are "mutable" data structures, which means they can be changed.

Which data types are mutable and which aren't depends on the programming language - some are designed with different choices in mind. Sometimes we need things that change, for example a file that edit, changing it from the old to the new version. But if some information is shared by several different programs, if one program changes it then the other may not learn of the change, and things can get complicated. To avoid the complications, some data types are not changeable - they are "immutable".

On the other hand, strings in python are "immutable". You can make a new string, but you can't modify one.

>>> word = 'cat'
>>> new_word = word.upper()   # produce a new, upper case word
>>> print(word)     # unchanged by .upper()
'cat'
>>> print(new_word)
'CAT'

So for lists, the .append() method modifies that list "in place".

But for strings, the .upper() method produces ("returns") a new string, leaving the original one as it was.

Some programming languages have syntax which makes the difference between these clear. Python is not one of those - you just have know which is which (as described in the documentation)

tuples

Python does have an immutable sequence type ... it's called the "tuple".

>>> a = (1,2,3)
>>> type(a)
<class 'tuple'>
>>> b = [100, 200, 300]
>>> type(b)
<class 'list]>

Tuples and lists are very similar ... except that one can be changed, and one can't.

To access the interior parts of a tuple, you still use [], just like strings and lists.

>>> print(a[0])   # 0'th element of the tuple
1
>>> print(b[0))   # 0'th element of the list
100

The difference is that you can change what's in the list, for example with an assignment statement. But trying to do that with a tuple gives an error.

>>> b[0] = 7   # change the 0'th element of the list named b
>>> print(b)   # show it
[7, 200, 300]
>>> a[0] = 7   # try to change the 0'th element of a tuple
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment

iterators

Another source of confusion is that python has data types that sometimes act like lists but aren't really. An example is the range() that we've been using to loop over things.

Looping over [0,1,2] and range(3) does the same thing:

>>> for i in [0,1,2]:
...     print(i)
0
1
2
>>> for i in range(3):
...     print(i)
0
1
2

But if we use type() to see what the data types are we get

>>> type([1,2,3])
   >>> type([1,2,3])
<class 'list'>
>>> type(range(10))
<class 'range'>

And trying to look at range(3) isn't very helpful.

>>> print(range(3))
range(3)

The reason for this complexity is that range() is often used for looping, and for a big loop like range(1000000) it isn't efficient to create all the numbers from 0 to a million and store them (which is what a list would have) before running the loop. Instead, for i in range(1000000) just sets i to each number in turn, without using a lot memory.

In other words, range() only produces those numbers when it needs to. This sort of computing approach is called "lazy", and it can be a powerful idea. But sometimes it just makes life more complicated.

If you really do want a list, you can convert an iterator like range with the list() function.

>>> list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

For one more example, consider the reversed built-in function which creates a new object from a list, with the elements in the other order.

>>> numbers = [10, 20, 30, 40]
>>> other_way = reversed(numbers)
>>> print(other_way)
<list_reverseiterator object at 0x7fd788132940>

I expected [40, 30, 20, 10] ... but what we got was a "lazy" object which hasn't yet actually done anything. We could loop over it, or convert it to a list. But expecting it actually be a list can lead to errors.

>>> print(other_way[2])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'list_reverseiterator' object is not subscriptable

Note that there is one other way to reverse a list.

>>> numbers.reverse()
>>> print(numbers)
[40, 30, 20, 10]

Th .reverse() method is one of those that acts "in place", modifying numbers itself. It does not give anything back.