The Internet Archive Python Library¶

Welcome to the documentation for the internetarchive Python library. internetarchive is a command-line and Python interface to archive.org. Please report any issues on Github.

If you’re not sure where to begin, the quickest and easiest way to get started is downloading a binary and taking a look at the command-line interface documentation.

User’s Guide¶

Installation¶

System-Wide Installation¶

Installing the internetarchive library globally on your system can be done with pip. This is the recommended method for installing internetarchive (see below for details on installing pip):

$ sudo pip install internetarchive

or, with easy_install:

$ sudo easy_install internetarchive

Either of these commands will install the internetarchive Python library and ia command-line tool on your system.

Note: Some versions of Mac OS X come with Python libraries that are required by internetarchive (e.g. the Python package six). This can cause installation issues. If your installation is failing with a message that looks something like:

OSError: [Errno 1] Operation not permitted: '/var/folders/bk/3wx7qs8d0x79tqbmcdmsk1040000gp/T/pip-TGyjVo-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info'

You can use the --ignore-installed parameter in pip to ignore the libraries that are already installed, and continue with the rest of the installation:

$ sudo pip install --ignore-installed internetarchive

More details on this issue can be found here: https://github.com/pypa/pip/issues/3165

Installing Pip¶

The easiest way to install pip is probably using your operating systems package manager.

Mac OS, with homebrew:

$ brew install pip

Ubuntu, with apt-get:

$ sudo apt-get install python-pip

If your OS doesn’t have a package manager, you can also install pip with get-pip.py:

$ curl -LOs https://bootstrap.pypa.io/get-pip.py
$ python get-pip.py

virtualenv¶

If you don’t want to, or can’t, install the package system-wide you can use virtualenv to create an isolated Python environment.

First, make sure virtualenv is installed on your system. If it’s not, you can do so with pip:

$ sudo pip install virtualenv

With easy_install:

$ sudo easy_install virtualenv

Or your systems package manager, apt-get for example:

$ sudo apt-get install python-virtualenv

Once you have virtualenv installed on your system, create a virtualenv:

$ mkdir myproject
$ cd myproject
$ virtualenv venv
New python executable in venv/bin/python
Installing setuptools, pip............done.

Activate your virtualenv:

$ . venv/bin/activate

Install internetarchive into your virtualenv:

$ pip install internetarchive

Snap¶

You can install the latest ia snap, and help testing the most recent changes of the master branch in all the supported Linux distros with:

$ sudo snap install ia --edge

Every time a new version of ia is pushed to the store, you will get it updated automatically.

Binaries¶

Binaries are also available for the ia command-line tool:

$ curl -LOs https://archive.org/download/ia-pex/ia
$ chmod +x ia

Binaries are generated with PEX. The only requirement for using the binaries is that you have Python installed on a Unix-like operating system.

For more details on the command-line interface please refer to the README, or ia help.

Get the Code¶

Internetarchive is actively developed on GitHub.

You can either clone the public repository:

$ git clone git://github.com/jjjake/internetarchive.git

Download the tarball:

$ curl -OL https://github.com/jjjake/internetarchive/tarball/master

Or, download the zipball:

$ curl -OL https://github.com/jjjake/internetarchive/zipball/master

Once you have a copy of the source, you can install it into your site-packages easily:

$ python setup.py install

Quickstart¶

Configuring¶

Certain functionality of the internetarchive Python library requires your archive.org credentials. Your IA-S3 keys are required for uploading, searching, and modifying metadata, and your archive.org logged-in cookies are required for downloading access-restricted content and viewing your task history. To automatically create a config file with your archive.org credentials, you can use the ia command-line tool:

$ ia configure
Enter your archive.org credentials below to configure 'ia'.

Email address: user@example.com
Password:

Config saved to: /home/user/.config/ia.ini

Your config file will be saved to $HOME/.config/ia.ini, or $HOME/.ia if you do not have a .config directory in $HOME. Alternatively, you can specify your own path to save the config to via ia --config-file '~/.ia-custom-config' configure.

If you have a netc file with your archive.org credentials in it, you can simply run ia configure --netrc. Note that Python’s netrc library does not currently support passphrases, or passwords with spaces in them, and therefore not currently suported here.

Uploading¶

Creating a new item on archive.org and uploading files to it is as easy as:

>>> from internetarchive import upload
>>> md = dict(collection='test_collection', title='My New Item', mediatype='movies')
>>> r = upload('<identifier>', files=['foo.txt', 'bar.mov'], metadata=md)
>>> r[0].status_code
200

You can set remote filename using a dictionary:

>>> r = upload('<identifier>', files={'remote-name.txt': 'local-name.txt'})

You can upload file-like objects:

>>> r = upload('iacli-test-item301', {'foo.txt': StringIO(u'bar baz boo')})

If the item already has a file with the same filename, the existing file within the item will be overwritten.

upload can also upload directories. For example, the following command will upload my_dir and all of it’s contents to https://archive.org/download/my_item/my_dir/:

>>> r = upload('my_item', 'my_dir')

To upload only the contents of the directory, but not the directory itself, simply append a slash to your directory:

>>> r = upload('my_item', 'my_dir/')

This will upload all of the contents of my_dir to https://archive.org/download/my_item/. upload accepts relative or absolute paths.

Note: metadata can only be added to an item using the upload function on item creation. If an item already exists and you would like to modify it’s metadata, you must use modify_metadata.

Metadata¶

Reading Metadata¶

You can access all of an item’s metadata via the Item object:

>>> from internetarchive import get_item
>>> item = get_item('iacli-test-item301')
>>> item.item_metadata['metadata']['title']
'My Title'

get_item retrieves all of an item’s metadata via the Internet Archive Metadata API. This metadata can be accessed via the Item.item_metadata attribute:

>>> item.item_metadata.keys()
dict_keys(['created', 'updated', 'd2', 'uniq', 'metadata', 'item_size', 'dir', 'd1', 'files', 'server', 'files_count', 'workable_servers'])

All of the top-level keys in item.item_metadata are available as attributes:

>>> item.server
'ia801507.us.archive.org'
>>> item.item_size
161752024
>>> item.files[0]['name']
'blank.txt'
>>> item.metadata['identifier']
'iacli-test-item301'

Writing Metadata¶

Adding new metadata to an item can be done using the modify_metadata function:

>>> from internetarchive import modify_metadata
>>> r = modify_metadata('<identifier>', metadata=dict(title='My Stuff'))
>>> r.status_code
200

Modifying metadata can also be done via the Item object. For example, changing the title we set in the example above can be done like so:

>>> r = item.modify_metadata(dict(title='My New Title'))
>>> item.metadata['title']
'My New Title'

To remove a metadata field from an item’s metadata, set the value to 'REMOVE_TAG':

>>> r = item.modify_metadata(dict(foo='new metadata field.'))
>>> item.metadata['foo']
'new metadata field.'
>>> r = item.modify_metadata(dict(title='REMOVE_TAG'))
>>> print(item.metadata.get('foo'))
None

The default behaviour of modify_metadata is to modify item-level metadata (i.e. title, description, etc.). If we want to modify different kinds of metadata, say the metadata of a specific file, we have to change the metadata target in the call to modify_metadata:

>>> r = item.modify_metadata(dict(title='My File Title'), target='files/foo.txt')
>>> f = item.get_file('foo.txt')
>>> f.title
'My File Title'

Refer to Internet Archive Metadata for more specific details regarding metadata and archive.org.

Downloading¶

Downloading files can be done via the download function:

>>> from internetarchive import download
>>> download('nasa', verbose=True)
nasa:
 downloaded nasa/globe_west_540.jpg to nasa/globe_west_540.jpg
 downloaded nasa/NASAarchiveLogo.jpg to nasa/NASAarchiveLogo.jpg
 downloaded nasa/globe_west_540_thumb.jpg to nasa/globe_west_540_thumb.jpg
 downloaded nasa/nasa_reviews.xml to nasa/nasa_reviews.xml
 downloaded nasa/nasa_meta.xml to nasa/nasa_meta.xml
 downloaded nasa/nasa_archive.torrent to nasa/nasa_archive.torrent
 downloaded nasa/nasa_files.xml to nasa/nasa_files.xml

By default, the download function sets the mtime for downloaded files to the mtime of the file on archive.org. If we retry downloading the same set of files we downloaded above, no requests will be made. This is because the filename, mtime and size of the local files match the filename, mtime and size of the files on archive.org, so we assume that the file has already been downloaded. For example:

>>> download('nasa', verbose=True)
nasa:
 skipping nasa/globe_west_540.jpg, file already exists based on length and date.
 skipping nasa/NASAarchiveLogo.jpg, file already exists based on length and date.
 skipping nasa/globe_west_540_thumb.jpg, file already exists based on length and date.
 skipping nasa/nasa_reviews.xml, file already exists based on length and date.
 skipping nasa/nasa_meta.xml, file already exists based on length and date.
 skipping nasa/nasa_archive.torrent, file already exists based on length and date.
 skipping nasa/nasa_files.xml, file already exists based on length and date.

Alternatively, you can skip files based on md5 checksums. This is will take longer because checksums will need to be calculated for every file already downloaded, but will be safer:

>>> download('nasa', verbose=True, checksum=True)
nasa:
 skipping nasa/globe_west_540.jpg, file already exists based on checksum.
 skipping nasa/NASAarchiveLogo.jpg, file already exists based on checksum.
 skipping nasa/globe_west_540_thumb.jpg, file already exists based on checksum.
 skipping nasa/nasa_reviews.xml, file already exists based on checksum.
 skipping nasa/nasa_meta.xml, file already exists based on checksum.
 skipping nasa/nasa_archive.torrent, file already exists based on checksum.
 skipping nasa/nasa_files.xml, file already exists based on length and date.

By default, the download function will download all of the files in an item. However, there are a couple parameters that can be used to download only specific files. Files can be filtered using the glob_pattern parameter:

>>> download('nasa', verbose=True, glob_pattern='*xml')
nasa:
 downloaded nasa/nasa_reviews.xml to nasa/nasa_reviews.xml
 downloaded nasa/nasa_meta.xml to nasa/nasa_meta.xml
 downloaded nasa/nasa_files.xml to nasa/nasa_files.xml

Files can also be filtered using the formats parameter. formats can either be a single format provided as a string:

>>> download('goodytwoshoes00newyiala', verbose=True, formats='MARC')
goodytwoshoes00newyiala:
 downloaded goodytwoshoes00newyiala/goodytwoshoes00newyiala_meta.mrc to goodytwoshoes00newyiala/goodytwoshoes00newyiala_meta.mrc

Or, a list of formats:

>>> download('goodytwoshoes00newyiala', verbose=True, formats=['DjVuTXT', 'MARC'])
goodytwoshoes00newyiala:
 downloaded goodytwoshoes00newyiala/goodytwoshoes00newyiala_meta.mrc to goodytwoshoes00newyiala/goodytwoshoes00newyiala_meta.mrc
 downloaded goodytwoshoes00newyiala/goodytwoshoes00newyiala_djvu.txt to goodytwoshoes00newyiala/goodytwoshoes00newyiala_djvu.txt

Downloading On-The-Fly Files¶

Some files on archive.org are generated on-the-fly as requested. This currently includes non-original files of the formats EPUB, MOBI, DAISY, and archive.org’s own MARC XML. These files can be downloaded using the on_the_fly parameter:

>>> download('goodytwoshoes00newyiala', verbose=True, formats='EPUB', on_the_fly=True)
goodytwoshoes00newyiala:
 downloaded goodytwoshoes00newyiala/goodytwoshoes00newyiala.epub to goodytwoshoes00newyiala/goodytwoshoes00newyiala.epub

Searching¶

The search_items function can be used to iterate through archive.org search results:

>>> from internetarchive import search_items
>>> for i in search_items('identifier:nasa'):
...     print(i['identifier'])
...
nasa

search_items can also yield Item objects:

>>> from internetarchive import search_items
>>> for item in search_items('identifier:nasa').iter_as_items():
...     print(item)
...
Collection(identifier='nasa', exists=True)

search_items will automatically paginate through large result sets.

Command-Line Interface¶

The ia command-line tool is installed with internetarchive, or available as a binary. ia allows you to interact with various archive.org services from the command-line.

Getting Started¶

The easiest way to start using ia is downloading a binary. The only requirements of the binary are a Unix-like environment with Python installed. To download the latest binary, and make it executable simply:

$ curl -LOs https://archive.org/download/ia-pex/ia
$ chmod +x ia
$ ./ia help
A command line interface to archive.org.

usage:
    ia [--help | --version]
    ia [--config-file FILE] [--log | --debug] [--insecure] <command> [<args>]...

options:
    -h, --help
    -v, --version
    -c, --config-file FILE  Use FILE as config file.
    -l, --log               Turn on logging [default: False].
    -d, --debug             Turn on verbose logging [default: False].
    -i, --insecure          Use HTTP for all requests instead of HTTPS [default: false]

commands:
    help      Retrieve help for subcommands.
    configure Configure `ia`.
    metadata  Retrieve and modify metadata for items on archive.org.
    upload    Upload items to archive.org.
    download  Download files from archive.org.
    delete    Delete files from archive.org.
    search    Search archive.org.
    tasks     Retrieve information about your archive.org catalog tasks.
    list      List files in a given item.

See 'ia help <command>' for more information on a specific command.

Metadata¶

Reading Metadata¶

You can use ia to read and write metadata from archive.org. To retrieve all of an item’s metadata in JSON, simply:

$ ia metadata TripDown1905

A particularly useful tool to use alongside ia is jq. jq is a command-line tool for parsing JSON. For example:

$ ia metadata TripDown1905 | jq '.metadata.date'
"1906"

Modifying Metadata¶

Once ia has been configured, you can modify metadata:

$ ia metadata <identifier> --modify="foo:bar" --modify="baz:foooo"

You can remove a metadata field by setting the value of the given field to REMOVE_TAG. For example, to remove the metadata field foo from the item <identifier>:

$ ia metadata <identifier> --modify="foo:REMOVE_TAG"

Note that some metadata fields (e.g. mediatype) cannot be modified, and must instead be set initially on upload.

The default target to write to is metadata. If you would like to write to another target, such as files, you can specify so using the --target parameter. For example, if we had an item whose identifier was my_identifier and we wanted to add a metadata field to a file within the item called foo.txt:

$ ia metadata my_identifier --target="files/foo.txt" --modify="title:My File"

You can also create new targets if they don’t exist:

$ ia metadata <identifier> --target="extra_metadata" --modify="foo:bar"

There is also an --append option which allows you to append a string to an existing metadata strings (Note: use --append-list for appending elments to a list). For example, if your item’s title was Foo and you wanted it to be Foo Bar, you could simply do:

$ ia metadata <identifier> --append="title: Bar"

If you would like to add a new value to an existing field that is an array (like subject or collection), you can use the --append-list option:

$ ia metadata <identifier> --append-list="subject:another subject"

This command would append another subject to the items list of subjects, if it doesn’t already exist (i.e. no duplicate elements are added).

Metadata fields or elements can be removed with the --remove option:

$ ia metadata <identifier> --remove="subject:another subject"

This would remove another subject from the items subject field, regardless of whether or not the field is a single or multi-value field.

Refer to Internet Archive Metadata for more specific details regarding metadata and archive.org.

Modifying Metadata in Bulk¶

If you have a lot of metadata changes to submit, you can use a CSV spreadsheet to submit many changes with a single command. Your CSV must contain an identifier column, with one item per row. Any other column added will be treated as a metadata field to modify. If no value is provided in a given row for a column, no changes will be submitted. If you would like to specify multiple values for certain fields, an index can be provided: subject[0], subject[1]. Your CSV file should be UTF-8 encoded. See metadata.csv for an example CSV file.

Once you’re ready to submit your changes, you can submit them like so:

$ ia metadata --spreadsheet=metadata.csv

See ia help metadata for more details.

Upload¶

ia can also be used to upload items to archive.org. After configuring ia, you can upload files like so:

$ ia upload <identifier> file1 file2 --metadata="mediatype:texts" --metadata="blah:arg"

Please note that, unless specified otherwise, items will be uploaded with a data mediatype. This cannot be changed afterwards. Therefore, you should specify a mediatype when uploading, eg. --metadata="mediatype:movies"

You can upload files from stdin:

$ curl http://dumps.wikimedia.org/kywiki/20130927/kywiki-20130927-pages-logging.xml.gz \
  | ia upload <identifier> - --remote-name=kywiki-20130927-pages-logging.xml.gz --metadata="title:Uploaded from stdin."

You can use the --retries parameter to retry on errors (i.e. if IA-S3 is overloaded):

$ ia upload <identifier> file1 --retries 10

Note that ia upload makes a backup of any files that are clobbered. They are saved to a directory in the item named history/files/. The files are named in the format $key.~N~. These files can be deleted like normal files. You can also prevent the backup from happening on clobbers by adding -H x-archive-keep-old-version:0 to your command.

Refer to archive.org Identifiers for more information on creating valid archive.org identifiers. Please also read the Internet Archive Items page before getting started.

Bulk Uploading¶

Uploading in bulk can be done similarly to Modifying Metadata in Bulk. The only difference is that you must provide a file column which contains a relative or absolute path to your file. Please see uploading.csv for an example.

Once you are ready to start your upload, simply run:

$ ia upload --spreadsheet=uploading.csv

See ia help upload for more details.

Download¶

Download an entire item:

$ ia download TripDown1905

Download specific files from an item:

$ ia download TripDown1905 TripDown1905_512kb.mp4 TripDown1905.ogv

Download specific files matching a glob pattern:

$ ia download TripDown1905 --glob="*.mp4"

Note that you may have to escape the * differently depending on your shell (e.g. \*.mp4, '*.mp4', etc.).

Download only files of a specific format:

$ ia download TripDown1905 --format='512Kb MPEG4'

Note that --format cannot be used with --glob. You can get a list of the formats of a given item like so:

$ ia metadata --formats TripDown1905

Download an entire collection:

$ ia download --search 'collection:glasgowschoolofart'

Download from an itemlist:

$ ia download --itemlist itemlist.txt

See ia help download for more details.

Downloading On-The-Fly Files¶

Some files on archive.org are generated on-the-fly as requested. This currently includes non-original files of the formats EPUB, MOBI, DAISY, and archive.org’s own MARC XML. These files can be downloaded using the --on-the-fly parameter:

$ ia download goodytwoshoes00newyiala --on-the-fly

Delete¶

You can use ia to delete files from archive.org items:

$ ia delete <identifier> <file>

Delete a file and all files derived from the specified file:

$ ia delete <identifier> <file> --cascade

Delete all files in an item:

$ ia delete <identifier> --all

Note that ia delete makes a backup of any files that are deleted. They are saved to a directory in the item named history/files/. The files are named in the format $key.~N~. These files can be deleted like normal files. You can also prevent the backup from happening on deletes by adding -H x-archive-keep-old-version:0 to your command.

See ia help delete for more details.

Search¶

ia can also be used for retrieving archive.org search results in JSON:

$ ia search 'subject:"market street" collection:prelinger'

By default, ia search attempts to return all items meeting the search criteria, and the results are sorted by item identifier. If you want to just select the top n items, you can specify a page and rows parameter. For example, to get the top 20 items matching the search ‘dogs’:

$ ia search --parameters="page=1&rows=20" "dogs"

You can use ia search to create an itemlist:

$ ia search 'collection:glasgowschoolofart' --itemlist > itemlist.txt

You can pipe your itemlist into a GNU Parallel command to download items concurrently:

$ ia search 'collection:glasgowschoolofart' --itemlist | parallel 'ia download {}'

See ia help search for more details.

Tasks¶

You can also use ia to retrieve information about your catalog tasks, after configuring ia. To retrieve the task history for an item, simply run:

$ ia tasks <identifier>

View all of your queued and running archive.org tasks:

$ ia tasks

See ia help tasks for more details.

List¶

You can list files in an item like so:

$ ia list goodytwoshoes00newyiala

See ia help list for more details.

Copy¶

You can copy files in archive.org items like so:

$ ia copy <src-identifier>/<src-filename> <dest-identifier>/<dest-filename>

If you’re copying your file to a new item, you can provide metadata as well:

$ ia copy <src-identifier>/<src-filename> <dest-identifier>/<dest-filename> --metadata 'title:My New Item' --metadata collection:test_collection

Note that ia copy makes a backup of any files that are clobbered. They are saved to a directory in the item named history/files/. The files are named in the format $key.~N~. These files can be deleted like normal files. You can also prevent the backup from happening on clobbers by adding -H x-archive-keep-old-version:0 to your command.

Move¶

ia move works just like ia copy except the source file is deleted after the file has been successfully copied.

Note that ia move makes a backup of any files that are clobbered or deleted. They are saved to a directory in the item named history/files/. The files are named in the format $key.~N~. These files can be deleted like normal files. You can also prevent the backup from happening on clobbers or deletes by adding -H x-archive-keep-old-version:0 to your command.

Internet Archive Items¶

What Is an Item?¶

Archive.org is made up of “items”. An item is a logical “thing” that we represent on one web page on archive.org. An item can be considered as a group of files that deserve their own metadata. If the files in an item have separate metadata, the files should probably be in different items. An item can be a book, a song, an album, a dataset, a movie, an image or set of images, etc. Every item has an identifier that is unique across archive.org.

How Items Are Structured¶

An item is just a directory of files and possibly subdirectories. Every item has at least two files named in the following format (see metadata page for more context on what an identifier is):

<identifier>_files.xml

<identifier>_meta.xml

The _meta.xml file is an XML file containing all of the metadata describing the item. The _files.xml file is an XML file containing all of the file-level metadata. There can only be one _meta.xml file and one _files.xml file per item.

Alongside these metadata files and the original files uploaded to the item, the item may also contain derivative files automatically generated by archive.org.

Item Limitations¶

As a rule of thumb, items should:

not be over 100GB

not contain more than 10,000 files.

Collections¶

All items must be part of a collection. A collection is simply an item with special characteristics. Besides an image file for the collection logo, files should never be uploaded directly to a collection item. Items can be assigned to a collection at the time of creation, or after the item has been created by modifying the collection element in an item’s metadata to contain the identifier for the given collection (i.e. ia metadata <identifier> -m collection:<collection-identifier>. Currently collections can only be created by archive.org staff. Please contact info@archive.org if you need a collection.

Archival URLs¶

An item’s “details” page will always be available at:

https://archive.org/details/<identifier>

The item directory is always available at:

https://archive.org/download/<identifier>

A particular file can always be downloaded from:

https://archive.org/download/<identifier>/<filename>

Note: Archival URLs may redirect to an actual server that contains the content. The resultant URL is not a permalink. For example, the archival URL:

https://archive.org/download/popeye_taxi-turvey/popeye_taxi-turvey_meta.xml

currently redirects to:

https://ia802304.us.archive.org/30/items/popeye_taxi-turvey/popeye_taxi-turvey_meta.xml

DO NOT LINK to any archive.org URL that begins with numbers like this. This refers to the particular machine that we’re serving the file from right now, but we move items to new servers all the time. If you link to this sort of URL, instead of the archival URL, your link WILL break at some point.

Internet Archive Metadata¶

Metadata is data about data. In the case of Internet Archive items, the metadata describes the contents of the items. Metadata can include information such as the performance date for a concert, the name of the artist, and a set list for the event.

Metadata is a very important element of items in the Internet Archive. Metadata allows people to locate and view information. Items with little or poor metadata may never be seen and can become lost.

Note that metadata keys must be valid XML tags. Please refer to the XML Naming Rules section here.

Archive.org Identifiers¶

Each item at Internet Archive has an identifier. An identifier is composed of any unique combination of alphanumeric characters, underscore (_) and dash (-). While there are no official limits it is strongly suggested that identifiers be between 5 and 80 characters in length.

Identifiers must be unique across the entirety of Internet Archive, not simply unique within a single collection.

Once defined an identifier can not be changed. It will travel with the item or object and is involved in every manner of accessing or referring to the item.

Standard Internet Archive Metadata Fields¶

There are several standard metadata fields recognized for Internet Archive items. Most metadata fields are optional.

addeddate¶

Contains the date on which the item was added to Internet Archive.

Please use an ISO 8601 compatible format for this date. For instance, these are all valid date formats:

YYYY
YYYY-MM-DD
YYYY-MM-DD HH:MM:SS

While it is possible to set the addeddate metadata value it is not recommended. This value is typically set by automated processes.

adder¶

The name of the account which added the item to the Internet Archive.

While is is possible to set the adder metadata value it is not recommended. This value is typically set by automated processes.

collection¶

A collection is a specialized item used for curation and aggregation of other items. Assigning an item to a collection defines where the item may be located by a user browsing Internet Archive.

A collection must exist prior to assigning any items to it. Currently collections can only be created by Internet Archive staff members. Please contact Internet Archive if you need a collection created.

All items should belong to a collection. If a collection is not specified at the time of upload, it will be added to the opensource collection. For testing purposes, you may upload to the test_collection collection.

contributor¶

The value of the contributor metadata field is information about the entity responsible for making contributions to the content of the item. This is often the library, organization or individual making the item available on Internet Archive.

The value of this metadata field may contain HTML. <script> tags and CSS are not allowed.

coverage¶

The extent or scope of the content of the material available in the item. The value of the coverage metadata field may include geographic place, temporal period, jurisdiction, etc. For items which contain multi-volume or serial content, place the statement of holdings in this metadata field.

creator¶

An entity primarily responsible for creating the files contained in the item.

credits¶

The participants in the production of the materials contained in the item.

The value of this metadata field may contain HTML. <script> tags and CSS are not allowed.

date¶

The publication, production or other similar date of this item.

Please use an ISO 8601 compatible format for this date.

description¶

A description of the item.

The value of this metadata field may contain HTML. <script> tags and CSS are not allowed.

language¶

The primary language of the material available in the item.

While the value of the language metadata field can be any value, Internet Archive prefers they be MARC21 Language Codes.

licenseurl¶

A URL to the license which covers the works contained in the item.

Internet Archive recommends (but does not require) Creative Commons licensing. Creative Commons provides a license selector for finding the correct license for your needs.

mediatype¶

The primary type of media contained in the item. While an item can contain files of diverse mediatypes the value in this field defines the appearance and functionality of the item’s detail page on Internet Archive. In particular, the mediatype of an item defines what sort of online viewer is available for the files contained in the item.

The mediatype metadata field recognizes a limited set of values:

audio: The majority of audio items should receive this mediatype value. Items for the Live Music Archive should instead use the etree value.
collection: Denotes the item as a collection to which other collections and items can belong.
data: This is the default value for mediatype. Items with a mediatype of data will be available in Internet Archive but you will not be able to browse to them. In addition there will be no online reader/player for the files.
etree: Items which contain files for the Live Music Archive should have a mediatype value of etree. The Live Music Archive has very specific upload requirements. Please consult the documentation for the Live Music Archive prior to creating items for it.
image: Items which predominantly consist of image files should receive a mediatype value of image. Currently these items will not be available for browsing or online viewing in Internet Archive but they will require no additional changes when this mediatype receives additional support in the Archive.
movies: All videos (television, features, shorts, etc.) should receive a mediatype value of movies. These items will be displayed with an online video player.
software: Items with a mediatype of software are accessible to browse via Internet Archive’s software collection. There is no online viewer for software but all files are available for download.
texts: All text items (PDFs, EPUBs, etc.) should receive a mediatype value of texts.
web: The web mediatype value is reserved for items which contain web archive WARC files.

If the mediatype value you set is not in the list above it will be saved but ignored by the system. The item will be treated as though it has a mediatype value of data.

If a value is not specified for this field it will default to data.

noindex¶

All items will have their metadata included in the Internet Archive search engine. To disable indexing in the search engine, include a noindex metadata tag. The value of the tag does not matter. Its presence is enough to trigger not including the metadata in the search engine.

If an item’s metadata has already been indexed in the search engine, setting noindex will remove it from the index.

Items whose metadata is not included in the search engine index are not considered “public” per se and therefore will not have a value in the publicdate metadata field (see below).

notes¶

Contains user-defined information about the item.

The value of this metadata field may contain HTML. <script> tags and CSS are not allowed.

pick¶

On the v1 archive.org site, each collection page on Internet Archive may include a “Staff Picks” section. This section will highlight a single item in the collection. This item will be selected at random from the items with a pick metadata value of 1. If there are no items with this pick metadata value the “Staff Picks” section will not appear on the collection page.

By default all new items have no pick metadata value. Note: v2 of the archive.org website does not make use of this value.

publicdate¶

Items which have had their metadata included in the Internet Archive search engine index are considered to be public. The date the metadata is added to the index is the public date for the item.

Please use an ISO 8601 compatible format for this date. For instance, these are all valid date formats:

YYYY
YYYY-MM-DD
YYYY-MM-DD HH:MM:SS

While it is possible to set the publicdate metadata value it is not recommended. This value is typically set by automated processes.

publisher¶

The publisher of the material available in the item.

rights¶

A statement of the rights held in and over the files in the item.

The value of this metadata field may contain HTML. <script> tags and CSS are not allowed.

subject¶

Keyword(s) or phrase(s) that may be searched for to find your item. This field can contain multiple values:

$ ia metadata <identifier> --modify='subject:foo' --modify='subject:bar'

Or, in Python:

>>> from internetarchive import modify_metadata
>>> md = dict(subject=['foo', 'bar'])
>>> r = modify_metadata('<identifier>', md)

It is helpful but not necessary for you to use Library of Congress Subject Headings for the value of this metadata header.

title¶

The title for the item. This appears in the header of the item’s detail page on Internet Archive.

If a value is not specified for this field it will default to the identifier for the item.

updatedate¶

The date on which an update was made to the item. This field is repeatable.

Please use an ISO 8601 compatible format for this date.

While it is possible to set the publicdate metadata value it is not recommended. This value is typically set by automated processes.

updater¶

The name of the account which updated the item. This field is repeatable.

While it is possible to set the updater metadata value it is not recommended. This value is typically set by automated processes.

uploader¶

The name of the account which uploaded the file(s) to the item.

The uploader has ownership over the item and is allowed to maintain it.

This value is set by automated processes.

Custom Metadata Fields¶

Internet Archive strives to be metadata agnostic, enabling users to define the metadata format which best suits the needs of their material. In addition to the standard metadata fields listed above you may also define as many custom metadata fields as you require. These metadata fields can be defined ad hoc at item creation or metadata editing time and do not have to be defined in advance.

Developer Interface¶

Configuration¶

Certain functions of the internetarchive library require your archive.org credentials (i.e. uploading, modifying metadata, searching). Your credentials and other configurations can be provided via a dictionary when instantiating an ArchiveSession or Item object, or in a config file.

The easiest way to create a config file is with the configure function:

>>> from internetarchive import configure
>>> configure('user@example.com', 'password')

Config files are stored in either $HOME/.ia or $HOME/.config/ia.ini by default. You can also specify your own path:

>>> from internetarchive import configure
>>> configure('user@example.com', 'password', config_file='/home/jake/.config/ia-alternate.ini')

Custom config files can be specified when instantiating an ArchiveSession object:

>>> from internetarchive import get_session
>>> s = get_session(config_file='/home/jake/.config/ia-alternate.ini')

Or an Item object:

>>> from internetarchive import get_item
>>> item = get_item('nasa', config_file='/home/jake/.config/ia-alternate.ini')

IA-S3 Configuration¶

Your IA-S3 keys are required for uploading and modifying metadata. You can retrieve your IA-S3 keys at https://archive.org/account/s3.php.

They can be specified in your config file like so:

[s3]
access = mYaccEsSkEY
secret = mYs3cREtKEy

Or, using the ArchiveSession object:

>>> from internetarchive import get_session
>>> c = {'s3': {'access': 'mYaccEsSkEY', 'secret': 'mYs3cREtKEy'}}
>>> s = get_session(config=c)
>>> s.access_key
'mYaccEsSkEY'

Logging Configuration¶

You can specify logging levels and the location of your log file like so:

[logging]
level = INFO
file = /tmp/ia.log

Or, using the ArchiveSession object:

>>> from internetarchive import get_session
>>> c = {'logging': {'level': 'INFO', 'file': '/tmp/ia.log'}}
>>> s = get_session(config=c)

By default logging is turned off.

Other Configuration¶

By default all requests are HTTPS in Python versions 2.7.10 or newer. You can change this setting in your config file in the general section:

[general]
secure = False

Or, using the ArchiveSession object:

>>> from internetarchive import get_session
>>> s = get_session(config={'general': {'secure': False}})

In the example above, all requests will be made via HTTP.

ArchiveSession Objects¶

The ArchiveSession object is subclassed from requests.Session. It collects together your credentials and config.

get_session(config=None, config_file=None, debug=None, http_adapter_kwargs=None)¶

Return a new ArchiveSession object. The ArchiveSession object is the main interface to the internetarchive lib. It allows you to persist certain parameters across tasks.

Parameters:	config (dict) – (optional) A dictionary used to configure your session. config_file (str) – (optional) A path to a config file used to configure your session. http_adapter_kwargs (dict) – (optional) Keyword arguments that `requests.adapters.HTTPAdapter` takes.
Returns:	`ArchiveSession` object.

Usage:

>>> from internetarchive import get_session
>>> config = dict(s3=dict(access='foo', secret='bar'))
>>> s = get_session(config)
>>> s.access_key
'foo'

From the session object, you can access all of the functionality of the internetarchive lib:

>>> item = s.get_item('nasa')
>>> item.download()
nasa: ddddddd - success
>>> s.get_tasks(task_ids=31643513)[0].server
'ia311234'

Item Objects¶

Item objects represent Internet Archive items. From the Item object you can create new items, upload files to existing items, read and write metadata, and download or delete files.

get_item(identifier, config=None, config_file=None, archive_session=None, debug=None, http_adapter_kwargs=None, request_kwargs=None)¶

Get an Item object.

Parameters:

identifier (str) – The globally unique Archive.org item identifier.
config (dict) – (optional) A dictionary used to configure your session.
config_file (str) – (optional) A path to a config file used to configure your session.
archive_session (ArchiveSession) – (optional) An ArchiveSession object can be provided via the archive_session parameter.
http_adapter_kwargs (dict) – (optional) Keyword arguments that requests.adapters.HTTPAdapter takes.
request_kwargs (dict) – (optional) Keyword arguments that requests.Request takes.

Usage:

>>> from internetarchive import get_item
>>> item = get_item('nasa')
>>> item.item_size
121084

Uploading¶

Uploading to an item can be done using Item.upload():

>>> item = get_item('my_item')
>>> r = item.upload('/home/user/foo.txt')

Or internetarchive.upload():

>>> from internetarchive import upload
>>> r = upload('my_item', '/home/user/foo.txt')

The item will automatically be created if it does not exist.

Refer to archive.org Identifiers for more information on creating valid archive.org identifiers.

Setting Remote Filenames¶

Remote filenames can be defined using a dictionary:

>>> from io import BytesIO
>>> fh = BytesIO()
>>> fh.write(b'foo bar')
>>> item.upload({'my-remote-filename.txt': fh})

upload(identifier, files, metadata=None, headers=None, access_key=None, secret_key=None, queue_derive=None, verbose=None, verify=None, checksum=None, delete=None, retries=None, retries_sleep=None, debug=None, request_kwargs=None, **get_item_kwargs)¶

Upload files to an item. The item will be created if it does not exist.

Parameters:

identifier (str) – The globally unique Archive.org identifier for a given item.
files – The filepaths or file-like objects to upload. This value can be an iterable or a single file-like object or string.
metadata (dict) – (optional) Metadata used to create a new item. If the item already exists, the metadata will not be updated – use modify_metadata.
headers (dict) – (optional) Add additional HTTP headers to the request.
access_key (str) – (optional) IA-S3 access_key to use when making the given request.
secret_key (str) – (optional) IA-S3 secret_key to use when making the given request.
queue_derive (bool) – (optional) Set to False to prevent an item from being derived after upload.
verbose (bool) – (optional) Display upload progress.
verify (bool) – (optional) Verify local MD5 checksum matches the MD5 checksum of the file received by IAS3.
checksum (bool) – (optional) Skip uploading files based on checksum.
delete (bool) – (optional) Delete local file after the upload has been successfully verified.
retries (int) – (optional) Number of times to retry the given request if S3 returns a 503 SlowDown error.
retries_sleep (int) – (optional) Amount of time to sleep between retries.
debug (bool) – (optional) Set to True to print headers to stdout, and exit without sending the upload request.
**kwargs – Optional arguments that get_item takes.

Returns:

A list of requests.Response objects.

Metadata¶

modify_metadata(identifier, metadata, target=None, append=None, append_list=None, priority=None, access_key=None, secret_key=None, debug=None, request_kwargs=None, **get_item_kwargs)¶

Modify the metadata of an existing item on Archive.org.

Parameters:

identifier (str) – The globally unique Archive.org identifier for a given item.
metadata (dict) – Metadata used to update the item.
target (str) – (optional) The metadata target to update. Defaults to metadata.
append (bool) – (optional) set to True to append metadata values to current values rather than replacing. Defaults to False.
append_list (bool) – (optional) Append values to an existing multi-value metadata field. No duplicate values will be added.
priority (int) – (optional) Set task priority.
access_key (str) – (optional) IA-S3 access_key to use when making the given request.
secret_key (str) – (optional) IA-S3 secret_key to use when making the given request.
debug (bool) – (optional) set to True to return a requests.Request object instead of sending request. Defaults to False.
**get_item_kwargs – (optional) Arguments that get_item takes.

Returns:

requests.Response object or requests.Request object if debug is True.

The default target to write to is metadata. If you would like to write to another target, such as files, you can specify so using the target parameter. For example, if we had an item whose identifier was my_identifier and you wanted to add a metadata field to a file within the item called foo.txt:

>>> r = modify_metadata('my_identifier', metadata=dict(title='My File'), target='files/foo.txt')
>>> from internetarchive import get_files
>>> f = list(get_files('iacli-test-item301', 'foo.txt'))[0]
>>> f.title
'My File'

You can also create new targets if they don’t exist:

>>> r = modify_metadata('my_identifier', metadata=dict(foo='bar'), target='extra_metadata')
>>> from internetarchive import get_item
>>> item = get_item('my_identifier')
>>> item.item_metadata['extra_metadata']
{'foo': 'bar'}

Downloading¶

download(identifier, files=None, formats=None, glob_pattern=None, dry_run=None, verbose=None, silent=None, ignore_existing=None, checksum=None, destdir=None, no_directory=None, retries=None, item_index=None, ignore_errors=None, on_the_fly=None, return_responses=None, **get_item_kwargs)¶

Download files from an item.

Parameters:	identifier (str) – The globally unique Archive.org identifier for a given item. files – (optional) Only return files matching the given file names. formats – (optional) Only return files matching the given formats. glob_pattern (str) – (optional) Only return files matching the given glob pattern. dry_run (bool) – (optional) Print URLs to files to stdout rather than downloading them. verbose (bool) – (optional) Turn on verbose output. silent (bool) – (optional) Suppress all output. ignore_existing (bool) – (optional) Skip files that already exist locally. checksum (bool) – (optional) Skip downloading file based on checksum. destdir (str) – (optional) The directory to download files to. no_directory (bool) – (optional) Download files to current working directory rather than creating an item directory. retries (int) – (optional) The number of times to retry on failed requests. item_index (int) – (optional) The index of the item for displaying progress in bulk downloads. ignore_errors (bool) – (optional) Don’t fail if a single file fails to download, continue to download other files. on_the_fly (bool) – (optional) Download on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files). return_responses (bool) – (optional) Rather than downloading files to disk, return a list of response objects. **kwargs – Optional arguments that `get_item` takes.
Return type:	bool
Returns:	True if all files were downloaded successfully.

Deleting¶

delete(identifier, files=None, formats=None, glob_pattern=None, cascade_delete=None, access_key=None, secret_key=None, verbose=None, debug=None, **kwargs)¶

Delete files from an item. Note: Some system files, such as <itemname>_meta.xml, cannot be deleted.

Parameters:

identifier (str) – The globally unique Archive.org identifier for a given item.
files – (optional) Only return files matching the given filenames.
formats – (optional) Only return files matching the given formats.
glob_pattern (str) – (optional) Only return files matching the given glob pattern.
cascade_delete (bool) – (optional) Also deletes files derived from the file, and files the filewas derived from.
access_key (str) – (optional) IA-S3 access_key to use when making the given request.
secret_key (str) – (optional) IA-S3 secret_key to use when making the given request.
verbose (bool) – Print actions to stdout.
debug (bool) – (optional) Set to True to print headers to stdout and exit exit without sending the delete request.

File Objects¶

get_files(identifier, files=None, formats=None, glob_pattern=None, on_the_fly=None, **get_item_kwargs)¶

Get File objects from an item.

Parameters:

identifier (str) – The globally unique Archive.org identifier for a given item.
files – iterable
files – (optional) Only return files matching the given filenames.
formats – iterable
formats – (optional) Only return files matching the given formats.
glob_pattern (str) – (optional) Only return files matching the given glob pattern.
on_the_fly (bool) – (optional) Include on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).
**get_item_kwargs – (optional) Arguments that get_item() takes.

Usage:

>>> from internetarchive import get_files
>>> fnames = [f.name for f in get_files('nasa', glob_pattern='*xml')]
>>> print(fnames)
['nasa_reviews.xml', 'nasa_meta.xml', 'nasa_files.xml']

Searching Items¶

search_items(query, fields=None, sorts=None, params=None, archive_session=None, config=None, config_file=None, http_adapter_kwargs=None, request_kwargs=None, max_retries=None)¶

Search for items on Archive.org.

Parameters:

query (str) – The Archive.org search query to yield results for. Refer to https://archive.org/advancedsearch.php#raw for help formatting your query.
fields (list) – (optional) The metadata fields to return in the search results.
params (dict) – (optional) The URL parameters to send with each request sent to the Archive.org Advancedsearch Api.
secure – (optional) Configuration options for session.
config_file (str) – (optional) A path to a config file used to configure your session.
http_adapter_kwargs (dict) – (optional) Keyword arguments that requests.adapters.HTTPAdapter takes.
request_kwargs (dict) – (optional) Keyword arguments that requests.Request takes.
max_retries (int, object) –
The number of times to retry a failed request. This can also be an urllib3.Retry object. If you need more control (e.g. status_forcelist), use a ArchiveSession object, and mount your own adapter after the session object has been initialized. For example:
```
>>> s = get_session()
>>> s.mount_http_adapter()
>>> search_results = s.search_items('nasa')
```
See ArchiveSession.mount_http_adapter() for more details.

Returns:

A Search object, yielding search results.

Internet Archive Tasks¶

get_tasks(identifier=None, task_ids=None, task_type=None, params=None, config=None, config_file=None, verbose=None, archive_session=None, http_adapter_kwargs=None, request_kwargs=None)¶

Get tasks from the Archive.org catalog. internetarchive must be configured with your logged-in-* cookies to use this function. If no arguments are provided, all queued tasks for the user will be returned.

Parameters:

identifier (str) – (optional) The Archive.org identifier for which to retrieve tasks for.
task_ids (int or str) – (optional) The task_ids to retrieve from the Archive.org catalog.
task_type (str) – (optional) The type of tasks to retrieve from the Archive.org catalog. The types can be either “red” for failed tasks, “blue” for running tasks, “green” for pending tasks, “brown” for paused tasks, or “purple” for completed tasks.
params (dict) – (optional) The URL parameters to send with each request sent to the Archive.org catalog API.
secure – (optional) Configuration options for session.
verbose (bool) – (optional) Set to True to retrieve verbose information for each catalog task returned. verbose is set to True by default.

Returns:

A set of CatalogTask objects.

Updates¶

Release History¶

1.8.0 (2018-06-28)¶

Feautres and Improvements

Only use backports.csv for python2 in support of FreeBDS port.
Added a nicer error message to ia search for authentication errors.
Added support for using netrc files in ia configure.
Added --remove option to ia metadata for removing values from single or mutli-field metadata elements.
Added support for appending a metadata value to an existing metadata element (as a new entry, not simply appending to a string).
Added --no-change-timestamp flag to ia download. Download files retain the timestamp of “now”, not of the source material when this option is used.

Bugfixes

Fixed bug in upload where StringIO objects were not uploadable.
Fixed encoding issues that were causing some ia tasks commands to fail.
Fixed bug where keep-old-version wasn’t working in ia move.
Fixed bug in internetarchive.api.modify_metadata where debug and other args were not honoured.

1.7.7 (2018-03-05)¶

Feautres and Improvements

Added support for downloading on-the-fly archive_marc.xml files.

Bugfixes

Improved syntax checking in ia move and ia copy.
Added Connection:close header to all requests to force close connections after each request. This is a workaround for dealing with a bug on archive.org servers where the server hangs up before sending the complete response.

1.7.6 (2018-01-05)¶

Feautres and Improvements

Added ability to set the remote-name for a directory in ia upload (previously you could only do this for single files).

Bugfixes

Fixed bug in ia delete where all requests were failing due to a typo in a function arg.

1.7.5 (2017-12-07)¶

Feautres and Improvements

Turned on x-archive-keep-old-version S3 header by default for all ia upload, ia delete, ia copy, and ia move commands. This means that any ia command that clobbers or deletes a command, will save a version of the file in <identifier>/history/files/$key.~N~. This is only on by default in the CLI, and not in the Python lib. It can be turne off by adding -H x-archive-keep-old-version:0 to any ia upload, ia delete, ia copy, or ia move command.

1.7.4 (2017-11-06)¶

Feautres and Improvements

Increased timeout in search from 12 seconds to 24.
Added ability to set the max_retries in internetarchive.search_items().
Made internetarchive.ArchiveSession.mount_http_adapter() a public method for supporting complex custom retry logic.
Added --timeout option to ia search for setting a custom timeout.
Loosened requirements for schema library to schema>=0.4.0.

Bugfixes

The scraping API has reverted to using items key rather than docs key. v1.7.3 will still work, but this change keeps ia consistent with the API.

1.7.3 (2017-09-20)¶

Bugfixes

Fixed bug in search where search requests were failing with KeyError: 'items'.

1.7.2 (2017-09-11)¶

Feautres and Improvements

Added support for adding custom headers to ia search.

Bugfixes

internetarchive.utils.get_s3_xml_text() is used to parse errors returned by S3 in XML. Sometimes there is no XML in the response. Most of the time this is due to 5xx errors. Either way, we want to always return the HTTPError, even if the XML parsing fails.
Fixed a regression where : was being stripped from filenames in upload.
Do not create a directory in download() when return_responses is True.
Fixed bug in upload where file-like objects were failing with a TypeError exception.

1.7.1 (2017-07-25)¶

Bugfixes

Fixed bug in Item.upload_file() where checksum was being set to True if it was set to None.

1.7.1 (2017-07-25)¶

Bugfixes

Fixed bug in ia upload where all commands would fail if multiple collections were specified (e.g. -m collection:foo -m collection:bar).

1.7.0 (2017-07-25)¶

Feautres and Improvements

Loosened up jsonpatch requirements, as the metadata API now supports more recent versions of the JSON Patch standard.
Added support for building “snap” packages (https://snapcraft.io/).

Bugfixes

Fixed bug in upload where users were unable to add their own timeout via request_kwargs.
Fixed bug where files with non-ascii filenames failed to upload on some platforms.
Fixed bug in upload where metadata keys with an index (e.g. subject[0]) would make the request fail if the key was the only indexed key provided.
Added a default timeout to ArchiveSession.s3_is_overloaded(). If it times out now, it returns True (as in, yes, S3 is overloaded).

1.6.0 (2017-06-27)¶

Features and Improvements

Added 60 second timeout to all upload requests.
Added support for uploading empty files.
Refactored Item.get_files() to be faster, especially for items with many files.
Updated search to use IA-S3 keys for auth instead of cookies.

Bugfixes

Fixed bug in upload where derives weren’t being queued in some cases where checksum=True was set.
Fixed bug where ia tasks and other Catalog functions were always using HTTP even when it should have been HTTPS.
ia metadata was exiting with a non-zero status for “no changes to xml” errors. This now exits with 0, as nearly every time this happens it should not be considered an “error”.
Added unicode support to ia upload --spreadsheet and ia metadata --spreadsheet using the backports.csv module.
Fixed bug in ia upload --spreadsheet where some metadata was accidentally being copied from previous rows (e.g. when multiple subjects were used).
Submitter wasn’t being added to ia tasks --json ouptut, it now is.
row_type in ia tasks --json was returning integer for row-type rather than name (e.g. ‘red’).

1.5.0 (2017-02-17)¶

Features and Improvements

Added option to download() for returning a list of response objects rather than writing files to disk.

1.4.0 (2017-01-26)¶

Bugfixes

Another bugfix for setting mtime correctly after fileobj functionality was added to ia download.

1.3.0 (2017-01-26)¶

Bugfixes

Fixed bug where download was trying to set mtime, even when fileobj was set to True (e.g. ia download <id> <file> --stdout).

1.2.0 (2017-01-26)¶

Features and Improvements

Added ia copy and ia move for copying and moving files in archive.org items.
Added support for outputing JSON in ia tasks.
Added support to ia download to write to stdout instead of file.

Bugfixes

Fixed bug in upload where AttributeError was rasied when trying to upload file-like objects without a name attribute.
Removed identifier validation from ia delete. If an identifier already exists, we don’t need to validate it. This only makes things annoying if an identifier exists but fails internetarchive id validation.
Fixed bug where error message isn’t returned in ia upload if the response body is not XML. Ideally IA-S3 would always return XML, but that’s not the case as of now. Try to dump the HTML in the S3 response if unable to parse XML.
Fixed bug where ArchiveSession headers weren’t being sent in prepared requests.
Fixed bug in ia upload --size-hint where value was an integer, but requests requries it to be a string.
Added support for downloading files to stdout in ia download and File.download.

1.1.0 (2016-11-18)¶

Features and Improvements

Make sure collection exists when creating new item via ia upload. If it doesn’t, upload will fail.
Refactored tests.

Bugfixes

Fixed bug where the full filepath was being set as the remote filename in Windows.
Convert all metadata header values to strings for compatability with requests>=2.11.0.

1.0.10 (2016-09-20)¶

Bugfixes

Convert x-archive-cascade-delete headers to strings for compatability with requests>=2.11.0.

1.0.9 (2016-08-16)¶

Features and Improvements

Added support to the CLI for providing username and password as options on the command-line.

1.0.8 (2016-08-10)¶

Features and Improvements

Increased maximum identifier length from 80 to 100 characters in ia upload.

Bugfixes

As of version 2.11.0 of the requests library, all header values must be strings (i.e. not integers). internetarchive now converts all header values to strings.

1.0.7 (2016-08-02)¶

Features and Improvements

Added internetarchive.api.get_user_info().

1.0.6 (2016-07-14)¶

Bugfixes

Fixed bug where upload was failing on file-like objects (e.g. StringIO objects).

1.0.5 (2016-07-07)¶

Features and Improvements

All metadata writes are now submitted at -5 priority by default. This is friendlier to the archive.org catalog, and should only be changed for one-off metadata writes.
Expanded scope of valid identifiers in utils.validate_ia_identifier (i.e. ia upload). Periods are now allowed. Periods, underscores, and dashes are not allowed as the first character.

1.0.4 (2016-06-28)¶

Features and Improvements

Search now uses the v1 scraping API endpoint.
Moved internetarchive.item.Item.upload.iter_directory() to internetarchive.utils.
Added support for downloading “on-the-fly” files (e.g. EPUB, MOBI, and DAISY) via ia download <id> --on-the-fly or item.download(on_the_fly=True).

Bugfixes

s3_is_overloaded() now returns True if the call is unsuccessful.
Fixed bug in upload where a derive task wasn’t being queued when a directory is uploaded.

1.0.3 (2016-05-16)¶

Features and Improvements

Use scrape API for getting total number of results rather than the advanced search API.
Improved error messages for IA-S3 (upload) related errors.
Added retry suport to delete.
ia delete no longer exits if a single request fails when deleting multiple files, but continues onto the next file. If any file fails, the command will exit with a non-zero status code.
All search requests now require authentication via IA-S3 keys. You can run ia configure to generate a config file that will be used to authenticate all search requests automatically. For more details refer to the following links:

http://internetarchive.readthedocs.io/en/latest/quickstart.html?highlight=configure#configuring

http://internetarchive.readthedocs.io/en/latest/api.html#configuration
Added ability to specify your own filepath in ia configure and internetarchive.configure().

Bugfixes

Updated requests lib version requirements. This resolves issues with sending binary strings as bodies in Python 3.
Improved support for Windows, see https://github.com/jjjake/internetarchive/issues/126 for more details.
Previously all requests were made in HTTP for Python versions < 2.7.9 due to the issues described at https://urllib3.readthedocs.org/en/latest/security.html. In favor of security over convenience, all requests are now made via HTTPS regardless of Python version. Refer to http://internetarchive.readthedocs.org/en/latest/troubleshooting.html#https-issues if you are experiencing issues.
Fixed bug in ia CLI where --insecure was still making HTTPS requests when it should have been making HTTP requests.
Fixed bug in ia delete where --all option wasn’t working because it was using item.iter_files instead of item.get_files.
Fixed bug in ia upload where uploading files with unicode file names were failing.
Fixed bug in upload where filenames with ; characters were being truncated.
Fixed bug in internetarchive.catalog where TypeError was being raised in Python 3 due to mixing bytes with strings.

1.0.2 (2016-03-07)¶

Bugfixes

Fixed OverflowError bug in uploads on 32-bit systems when uploading files larger than ~2GB.
Fixed unicode bug in upload where urllib.parse.quote is unable to parse non-encoded strings.

Features and Improvements

Only generate MD5s in upload if they are used (i.e. verify, delete, or checksum is True).
verify is off by default in ia upload, it can be turned on with ia upload --verify.

1.0.1 (2016-03-04)¶

Bugfixes

Fixed memory leak in ia upload –spreadsheet=metadata.csv.
Fixed arg parsing bug in ia CLI.

1.0.0 (2016-03-01)¶

Features and Improvements

Renamed internetarchive.iacli to internetarchive.cli.
Moved File object to internetarchive.files.
Converted config fromat from YAML to INI to avoid PyYAML requirement.
Use HTTPS by default for Python versions > 2.7.9.
Added get_username function to API.
Improved Python 3 support. internetarchive is now being tested against Python versions 2.6, 2.7, 3.4, and 3.5.
Improved plugin support.
Added retry support to download and metadata retrieval.
Added Collection object.
Made Item objects hashable and orderable.

Bugfixes

IA’s Advanced Search API no longer supports deep-paging of large result sets. All search functions have been refactored to use the new Scrape API (http://archive.org/help/aboutsearch.htm). Search functions in previous versions are effictively broken, upgrade to >=1.0.0.

0.9.8 (2015-11-09)¶

Bugfixes

Fixed ia help bug.
Fixed bug in File.download() where connection errors weren’t being caught/retried correctly.

0.9.7 (2015-11-05)¶

Bugfixes

Cleanup partially downloaded files when download() fails.

Features and Improvements

Added –format option to ia delete.
Refactored download() and ia download to behave more like rsync. Files are now clobbered by default, ignore_existing and –ignore-existing now skip over files already downloaded without making a request.
Added retry support to download() and ia download.
Added files kwarg to Item.download() for downloading specific files.
Added ignore_errors option to File.download() for ignoring (but logging) exceptions.
Added default timeouts to metadata and download requests.
Less verbose output in ia download by default, use ia download –verbose for old style output.

0.9.6 (2015-10-12)¶

Bugfixes

Removed sync-db features for now, as lazytaable is not playing nicely with setup.py right now.

0.9.5 (2015-10-12)¶

Features and Improvements

Added skip based on mtime and length if no other clobber/skip options specified in download() and ia download.

0.9.4 (2015-10-01)¶

Features and Improvements

Added internetarchive.api.get_username() for retrieving a username with an S3 key-pair.
Added ability to sync downloads via an sqlite database.

0.9.3 (2015-09-28)¶

Features and Improvements

Added ability to download items from an itemlist or search query in ia download.
Made ia configure Python 3 compatabile.

Bugfixes

Fixed bug in ia upload where uploading an item with more than one collection specified caused the collection check to fail.

0.9.2 (2015-08-17)¶

Bugfixes

Added error message for failed ia configure calls due to invalid creds.

0.9.1 (2015-08-13)¶

Bugfixes

Updated docopt to v0.6.2 and PyYAML to v3.11.
Updated setup.py to automatically pull version from __init__.

0.8.5 (2015-07-13)¶

Bugfixes

Fixed UnicodeEncodeError in ia metadata –append.

Features and Improvements

Added configuration documentation to readme.
Updated requests to v2.7.0

0.8.4 (2015-06-18)¶

Features and Improvements

Added check to ia upload to see if the collection being uploaded to exists. Also added an option to override this check.

0.8.3 (2015-05-18)¶

Features and Improvements

Fixed append to work like a standard metadata update if the metadata field does not yet exist for the given item.

0.8.0 2015-03-09¶

Bugfixes

Encode filenames in upload URLs.

0.7.9 (2015-01-26)¶

Bugfixes

Fixed bug in internetarchive.config.get_auth_config (i.e. ia configure) where logged-in cookies returned expired within hours. Cookies should now be valid for about one year.

0.7.8 (2014-12-23)¶

Output error message when downloading non-existing files in ia download rather than raising Python exception.
Fixed IOError in ia search when using head, tail, etc..
Simplified ia search to output only JSON, rather than doing any special formatting.
Added experimental support for creating pex binaries of ia in Makefile.

0.7.7 (2014-12-17)¶

Simplified ia configure. It now only asks for Archive.org email/password and automatically adds S3 keys and Archive.org cookies to config. See internetarchive.config.get_auth_config().

0.7.6 (2014-12-17)¶

Write metadata to stdout rather than stderr in ia mine.
Added options to search archive.org/v2.
Added destdir option to download files/itemdirs to a given destination dir.

0.7.5 (2014-10-08)¶

Fixed typo.

0.7.4 (2014-10-08)¶

Fixed missing “import” typo in internetarchive.iacli.ia_upload.

0.7.3 (2014-10-08)¶

Added progress bar to ia mine.
Fixed unicode metadata support for upload().

0.7.2 (2014-09-16)¶

Suppress KeyboardInterrupt exceptions and exit with status code 130.
Added ability to skip downloading files based on checksum in ia download, Item.download(), and File.download().
ia download is now verbose by default. Output can be suppressed with the –quiet flag.
Added an option to not download into item directories, but rather the current working directory (i.e. ia download –no-directories <id>).
Added/fixed support for modifying different metadata targets (i.e. files/logo.jpg).

0.7.1 (2014-08-25)¶

Added Item.s3_is_overloaded() method for S3 status check. This method is now used on retries in the upload method now as well. This will avoid uploading any data if a 503 is expected. If a 503 is still returned, retries are attempted.
Added –status-check option to ia upload for S3 status check.
Added –source parameter to ia list for returning files matching IA source (i.e. original, derivative, metadata, etc.).
Added support to ia upload for setting remote-name if only a single file is being uploaded.
Derive tasks are now only queued after the last file has been uploaded.
File URLs are now quoted in File objects, for downloading files with specail characters in their filenames

0.7.0 (2014-07-23)¶

Added support for retry on S3 503 SlowDown errors.

0.6.9 (2014-07-15)¶

Added support for n and r characters in upload headers.
Added support for reading filenames from stdin when using the ia delete command.

0.6.8 (2014-07-11)¶

The delete ia subcommand is now verbose by default.
Added glob support to the delete ia subcommand (i.e. ia delete –glob=’*jpg’).
Changed indexed metadata elements to clobber values instead of insert.
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are now deprecated. IAS3_ACCESS_KEY and IAS3_SECRET_KEY must be used if setting IAS3 keys via environment variables.

Troubleshooting¶

HTTPS Issues¶

The internetarchive library uses the HTTPS protocol for making secure requests by default. This can cause issues when using versions of Python earlier than 2.7.9:

Certain Python platforms (specifically, versions of Python earlier than 2.7.9) have restrictions in their ssl module that limit the configuration that urllib3 can apply. In particular, this can cause HTTPS requests that would succeed on more featureful platforms to fail, and can cause certain security features to be unavailable.

See https://urllib3.readthedocs.org/en/latest/security.html for more details.

If you are using a Python version earlier than 2.7.9, you might see InsecurePlatformWarning and SNIMissingWarning warnings and your requests might fail. There are a few options to address this issue:

Upgrade your Python to version 2.7.9 or more recent.

Install or upgrade the following Python modules as documented here: PyOpenSSL, ndg-httpsclient, and pyasn1.
Use HTTP to make insecure requests in one of the following ways:
Adding the following lines to your ia.ini config file (usually located at ~/.config/ia.ini or ~/.ia.ini):
[general]
secure = false
In the Python interface, using a config dict:
>>> from internetarchive import get_item
>>> config = dict(general=dict(secure=False))
>>> item = get_item('<identifier>', config=config)
In the command-line interface, use the --insecure option:
$ ia --insecure download <identifier>

OverflowError¶

On some 32-bit systems you may run into issues uploading files larger than 2 GB. You may see an error that looks something like OverflowError: long int too large to convert to int. You can get around this by upgrading requests:

pip install --upgrade requests

You can find more details about this issue at the following links:

https://github.com/sigmavirus24/requests-toolbelt/issues/80 https://github.com/kennethreitz/requests/issues/2691

How to Contribute¶

Thank you for considering contributing. All contributions are welcome and appreciated!

Support Questions¶

Please don’t use the Github issue tracker for asking support questions. All support questions should be emailed to info@archive.org.

Bug Reports¶

Github issues is used for tracking bugs. Please consider the following when opening an issue:

Avoid opening duplicate issues by taking a look at the current open issues.
Provide details on the version, operating system and Python version you are running.
Include complete tracebacks and error messages.

Pull Requests¶

All pull requests and patches are welcome, but please consider the following:

Include tests.
Include documentation for new features.
If your patch is supposed to fix a bug, please describe in as much detail as possible the circumstances in which the bug happens.
Please follow PEP8, with the exception of what is ignored in setup.cfg. PEP8 compliancy is checked when tests run. Tests will fail if your patch is not PEP8 compliant.
Add yourself to AUTHORS.rst.
Avoid introducing new dependencies.
Open an issue if a relevant one is not already open, so others have visibility into what you’re working on and efforts aren’t duplicated.
Clarity is preferred over brevity.

Running Tests¶

The minimal requirements for running tests are pytest, pytest-pep8 and responses:

$ pip install pytest pytest-pep8 responses

Clone the internetarchive lib:

$ git clone https://github.com/jjjake/internetarchive

Install the internetarchive lib as an editable package:

$ cd internetarchive
$ pip install -e .

Run the tests:

$ py.test --pep8

Note that this will only test against the Python version you are currently using, however internetarchive tests against multiple Python versions defined in tox.ini. Tests must pass on all versions defined in tox.ini for all pull requests.

To test against all supported Python versions, first make sure you have all of the required versions of Python installed. Then simply install execute tox from the root directory of the repo:

$ pip install tox
$ tox

Even easier is simply creating a pull request. Travis is used for continuous integration, and is set up to run the full testsuite whenever a pull request is submitted or updated.

Authors¶

The Internet Archive Python library and command-line tool is written and maintained by Jake Johnson and various contributors:

Development Lead¶

Jake Johnson <jake@archive.org>

Contributors¶

Bryce Drennan <internetarchive@brycedrennan.com>

Patches and Suggestions¶

VM Brasseur

The Internet Archive Python Library¶

User’s Guide¶

Installation¶

System-Wide Installation¶

Installing Pip¶

virtualenv¶

Snap¶

Binaries¶

Get the Code¶

Quickstart¶

Configuring¶

Uploading¶

Metadata¶

Reading Metadata¶

Writing Metadata¶

Downloading¶

Downloading On-The-Fly Files¶

Searching¶

Command-Line Interface¶

Getting Started¶

Metadata¶

Reading Metadata¶

Modifying Metadata¶

Modifying Metadata in Bulk¶

Upload¶

Bulk Uploading¶

Download¶

Downloading On-The-Fly Files¶

Delete¶

Search¶

Tasks¶

List¶

Copy¶

Move¶

Internet Archive Items¶

What Is an Item?¶

How Items Are Structured¶

Item Limitations¶

Collections¶

Archival URLs¶

Internet Archive Metadata¶

Archive.org Identifiers¶

Standard Internet Archive Metadata Fields¶

addeddate¶

adder¶

collection¶

contributor¶

coverage¶

creator¶

credits¶

date¶

description¶

language¶

licenseurl¶

mediatype¶

noindex¶

notes¶

pick¶

publicdate¶

publisher¶

rights¶

subject¶

title¶

updatedate¶

updater¶

uploader¶

Custom Metadata Fields¶

Developer Interface¶

Configuration¶

IA-S3 Configuration¶

Cookie Configuration¶

Logging Configuration¶

Other Configuration¶

ArchiveSession Objects¶

Item Objects¶

Uploading¶

Setting Remote Filenames¶

Metadata¶

Downloading¶

Deleting¶